Chapter 1. Doctor’s Black Bag

In C. M. Kornbluth’s short story “The Little Black Bag,”¹ Dr. Full, a struggling physician, discovers a mysterious doctor’s bag from the future filled with advanced medical devices and medicines. This futuristic equipment enhances his abilities as a doctor, allowing him to diagnose and treat patients with unprecedented efficiency. As he explores the bag’s contents, Dr. Full finds himself marveling at the potential of future medical innovations, feeling as though he’s standing on the cusp of a healthcare revolution.

However, Kornbluth’s tale is not simply a celebration of medical progress. As the story unfolds, it becomes clear that such powerful technology, in the wrong hands, can be exploited for personal gain rather than the greater good. The narrative serves as a reminder that while future medical advancements may seem magical in their capabilities, they also come with significant ethical responsibilities and potential for misuse. This fictional account prompts us to consider both the exciting possibilities and the ethical challenges that lie ahead as medical technology and AI (LLMs and generative AI) continue to advance rapidly, often in interconnected ways.

AI and LLMs have the potential to enhance clinical care, especially in those areas that augment clinicians’ decision-making processes, enhance patient-doctor interactions, streamline administrative tasks, improve patient education and engagement, and ultimately lead to better health outcomes. By leveraging the capabilities of multimodal LLMs, healthcare institutions have the potential to develop sophisticated virtual medical assistants that can proactively monitor patient health and aid in diagnosis, according to a senior medical executive at NewYork-Presbyterian, a prominent hospital network in New York.²

Just as the traditional black bag equipped physicians with essential tools to provide quality care, AI is poised to become an indispensable asset empowering clinicians to deliver more personalized, efficient, and evidence-based care to their patients. This chapter explores the possibilities for improved patient care using LLMs—especially where they might improve healthcare where clinicians and patients benefit the most.

LLMs are natural language processing (NLP) machine learning models that can seemingly understand³ and generate human language text. LLMs are a type of artificial intelligence (AI) that comprehends and manipulates human language with remarkable proficiency. They are called “large” because they are trained on vast amounts of text data, often billions of words, which enables them to learn the nuances of human language.

For clinicians, LLMs can be thought of as advanced language processing tools that can assist with a variety of administrative tasks involving healthcare data (structured like electronic health records [EHRs] or unstructured doctor notes). Just as stethoscopes and X-ray machines extend a clinician’s abilities to assess a patient’s health, LLMs can enhance a clinician’s capacity to analyze and interpret large amounts of research data, email threads with embedded videos, a patient’s historical health records, clinical notes, discharge summaries, and more.

Generative AI is a subset or type of AI, just as LLMs and machine learning are types of AI. Generative AI is focused on creating new content such as text, images, video, or audio often in response to a user’s questions. The generated outputs often resemble human created content in terms of style and structure.

When we use phrases such as LLMs or generative AI in this book, we do so as catch-all terms that encompass a wide range of AI systems, even if they have different attributes or employ different machine learning algorithms. These catch-all terms include but are not limited to LLMs, small language models, multimodal models, and generative AI.

Think of these models as having been trained on lots of our written information (e.g., books, articles, and websites) and on tons of subjects in the world. This allows them to understand something about the relationships among words in sentences and paragraphs, the meanings that accumulate upon chapters, the overall progressions exhibited by narrative arcs, and so on.

When OpenAI released its language model ChatGPT in 2022, it transformed conversational AI and liberated NLP, opening up human-sounding conversation and question-answering for the masses with an easy web interface. This LLM gained worldwide popularity for its ability to hold human-like conversations, answer questions, ghostwriting essays and perform a wide range of tasks. It sparked interest from CEOs with technology companies about the technology impact on business and everyday life. But let’s remind ourselves of how we should think about using LLMs.

A 2023 blog⁴ discusses ChatGPT experiments providing insights on how we should think about and use LLMS. If we treat LLMs as summarization tools and treat their prompts not as commands to another sentient being but as anchors for that process where summarization is attuned to something in the real world, we can use these tools effectively in healthcare. We did something similar with the laconic constraints of keyword search: we learned and are still learning how to “steer” search toward anything we want. Since the advent of search some decades ago, we have learned how to formulate questions in linguistic anchors that are likely to lead their search toward what they want. We can do the same for LLM prompts, treating them as anchors for summarization in preference to specifications and thereby focusing it well. Once we treat LLM prompts as anchors to summarization, we can steer them more effectively by embedding the summarization task in the model’s knowledge base, on the one hand, and in the task scope defined by the prompt, on the other.

As we begin to better understand how LLMs work, the hyperbole needs to be balanced with realistic assessment of how to build and leverage LLMs in healthcare. LLMs are statistical, autoregressive models, a class of machine learning models that predict the next word from the context. For example, let’s suppose you are an author writing a story. You have a writing assistant that offers you the next appropriate word based on the words you have already written. This assistant has read many stories, of all different types, so it knows approximately how words follow other words in order to comprise generally useful sentences and more complex narratives. With the story-writing assistant, you keep writing, keep accepting the suggestions, and from word to word, one after the other, your story grows longer. With each step, the next word follows from the words written before. That, in essence, is how an autoregressive model works: it learns from existing data and generates (or predicts) new data, one step at a time, based on the previous sequence of data.

Gemini, ChatGPT, Claude.AI, and other autoregressive LLMs provide the illusion of reasoning the way people do, providing amazing responses to nuanced or complicated prompts. They even seem to act like people providing seemingly emotional reactions and empathetic understanding. These illusions are made more believable because of our cognitive biases, that is, our tendency to anthropomorphize. The next chapter, Chapter 2, discusses the workings of LLMs more specifically, including in explicit details the tokens, parameters, etc.

Potential of LLMs and Generative AI

While existing medical LLMs are already impressive and useful in some ways, development is in the early stages, and these innovations have achieved only a fraction of the potential to transform how we deliver healthcare. Current developments emphasize reducing clinician burden and documentation—but this is still only the beginning of LLM impacts on healthcare delivery. There are already entire medical-specific versions published, but they’re not yet making a splash in how clinical care gets delivered for a variety of reasons.

Data availability and quality: LLMs are trained on huge datasets, and their performance will depend on the quality of the data used to train them. In medicine, data is often distributed across multiple sources, such as EHRs, medical journals, and randomized clinical trials. Moreover, data needs to be complete, accurate, and consistent; data of lower quality can impact how LLMs perform.
Bias and fairness: LLMs learn from biased data, meaning they are trained on data that reflects the biases of the world. This can lead to transferring and reinforcing already potent biases in how we deliver care to certain patient groups. For example, an LLM that’s trained on a biased dataset of medical records (e.g., datasets that contain low numbers of specific racial or ethnic groups) can also generate biased recommendations. When discussing bias, the focus is on the system working properly as intended. Mitigating bias is often critical to system functionality and successful usage.
Interpretability and explainability: Achieving interpretability or explainability has been one of the most significant challenges in AI development. LLMs are often considered “black boxes” due to their complex inner workings, which operate in such an obscure manner that their overall functionality and output are difficult to understand or predict. The lack of interpretability and explainability is problematic in medicine. There will be significant reluctance to adopt LLMs in healthcare if we cannot understand how they arrive at specific recommendations or diagnoses.
Regulatory landscape: A key challenge surrounding AI development in medicine is the regulatory void. There’s currently no clear guidelines to define what it means to develop and deploy LLMs in healthcare. This uncertainty effectively dampens enthusiasm in trying to use AI in medically complex and high-stakes contexts since there’s little precedent to guide healthcare organizations on how to act under these conditions.
Ethical landscape: In general, concerns about using LLMs in medicine include the potential for misuse, erosion of patient autonomy, and invasion of patient privacy. Ethical concerns must be considered before LLMs can be used in healthcare.

LLMs explicitly model relations between words and meaning over a long stretch of text, resulting in more fluent textual understanding and generation of content. In addition, they differ from more primitive language models, which can only concatenate words into a text pattern due to their scale of data and the parameterization of models. Chapter 2 details how LLMs work, explaining parameters, tokens, and more.

Existing medical and other LLMs are, in many ways, already pretty impressive. We’re just in the very early stages of the maturation of these algorithms and models, but the potential for a transformation of healthcare delivery is tremendous. Still, much of the current focus is on decreasing administrative and documentation burdens to clinicians, and that’s just the start of how things might change. The current generation of LLMs may soon feel fairly primitive. There will be many more awe-inspiring and transformative applications in the years as LLMs and other kinds of AI grow and improve.

There are several LLMs and other AI platforms that have been specifically created for applications in medicine and healthcare, some of which are research prototypes and others that are more mature and used in real-world applications. Here are a few prominent examples.

PubMedBERT: This LLM⁵ and its associated model had pretraining on biomedical text, and the researchers state that it outperforms all prior language models. It’s designed to excel in the biomedical domain. It is trained on a large amount of biomedical research papers from the PubMed⁶ database. It uses BERT, an NLP model developed by Google. It is designed to help computers better understand and interpret human language in a way that considers the context and relationships between words. BERT can understand the meaning of a word based on the words that come before and after it in a sentence or paragraph. BERT has revolutionized the field of NLP and has been widely adopted in various applications, such as search engines, chatbots, and sentiment analysis tools. Its ability to understand and interpret human language has significant implications for improving human-computer interaction and enabling more accurate and efficient processing of large volumes of text data.
BioBERT: This is a specialized language model adapted for biomedical text. It builds upon the original BERT model,⁷ which was trained on a general text corpora. BioBERT is further trained on biomedical literature, enhancing its ability to understand and process medical and scientific language. It is pretrained on very large-scale corpora from the biomedical domain—particularly, a combination of PubMed abstracts and PubMed Central (PMC) full-text articles from the US National Library of Medicine.
SciBERT: Designed by the Allen Institute for AI, this BERT-derived model⁸ is trained on a large corpus of scientific text from domains, including biomedical and computer science literature, and it has been applied to tasks such as scientific document abstraction.
ClinicalBERT: Designed to learn the domain-specific language of clinical text and its distinctive structure, such as human-sounding paraphrases, ClinicalBERT⁹ is a domain-specific LLM based on clinical notes from the MIMIC-III database,¹⁰ and it trains to carry out tasks such as clinical named entity recognition, relation extraction, and sentiment analysis.
Med-PaLM from Google: Google’s Pathways Language Model (PaLM) has been fine-tuned on medical knowledge to create Med-PaLM,¹¹ which scored high on various medical benchmarks, including tasks such as answering medical exam questions and providing clinical decision support. Google also announced Med-PaLM2,¹² which reached human expert level on answering U.S. Medical Licensing Examination (USMLE) type questions.

Whether using the domain-specific examples of LLMs previously mentioned or coupling multiple core models such as GPT-4 (OpenAI), Claude 3 family (Anthropic), Gemini (Google), or LLaMA 2 (Meta) coupled with a company’s proprietary data, LLMs will change the healthcare industry for the better. These AI models make navigating, finding, and understanding content on a health plan or payer’s website easier. The models accelerate medical research by analyzing large datasets from EHRs, clinical trials, and scientific literature. Recent advances of LLMs offer the ability for a doctor or researcher to have an LLM read a short or lengthy email or set of emails, many of which may include video or audio clips and summaries for the clinician.

Moreover, LLMs are addressing various challenges in healthcare, such as deciphering and cleaning up medical notes. They also bridge patient-provider communication gaps through conversational AI, ensure thorough understanding of patient histories before treatment, and analyze healthcare data from various sources to gain better patient insights. As LLMs continue to evolve and integrate into the healthcare system, their impact is expected to be transformative, shaping the future of patient care, research, and communication.

They still have much ground to cover to consistently surpass the most skilled medical professionals’ expertise. Still, there is enormous potential for integrating LLMs as a third element in the doctor-patient relationship. LLMs could be useful for assisting in diagnosis, documentation, and patient communication.

Every clinical and administrative healthcare process that requires humans to create original work with the data extracted from medical coding, patient education, diagnosis, patient intake, treatment planning, medication management, etc., is up for reinvention.

LLM and generative AI applications or apps are starting to take off, thanks to the maturation of the platform layer, the continuous improvement of models, and the increasing availability of free and open source models. This gives developers, startups, and enterprises the tools they need to build innovative applications. As mobile devices spawned new types of applications with new capabilities like sensors, cameras, and on-the-go connectivity, LLMs are poised to usher in a new wave of generative AI applications and devices for healthcare.

For instance, Perplexity AI demonstrates how LLMs can be leveraged to create powerful, user-friendly interfaces for information retrieval and analysis. Perplexity is a popular AI-powered search and chat platform that utilizes existing LLMs (such as GPT-3.5 and GPT-4) rather than being a distinct LLM itself. As an AI application or interface built on top of these models, Perplexity showcases how developers can create innovative tools by harnessing the capabilities of advanced LLMs. While not specifically focused on healthcare, such applications illustrate the potential for LLM-driven tools to revolutionize information access and decision support across various fields, including medicine.

These days, there is no shortage of healthcare gadgets to help us optimize our lives. A search of the internet shows a vast array of wearable medical devices used in healthcare¹³—including blood pressure monitors, glucose meters, ECG monitors, fitness trackers, and more. We like to attach these devices to ourselves to make our lives healthier and our working conditions easier. These devices will become more useful as health LLMs advance. For example, LLMs could integrate data from multiple sources—such as your Fitbit, diet app, exercise app, fasting app, and sleep tracker—to provide a more holistic view of your health. They could then analyze this combined data to identify patterns and trends that might not be evident when looking at each data source in isolation.

Making internet searches remains a common research method for self-diagnosis. Researching one’s symptoms in this manner is often colloquially referred to as “consulting Dr. Google.” However, the evidence is clear that internet searches correlate to small increases in diagnostic accuracy and almost none in triage accuracy.¹⁴ LLMs will change this equation as internet search and LLMs integrate over the next few years. The question-and-answer interaction model of leading LLMs directs you to a conversation-like exchange with the internet—one that’s context-sensitive and generative.

Let’s explore the current differences between, say, a Google search and leading LLMs question-and-answer prompt.

Interaction style: A Google search is an initiating-response style, whereas leading LLMs employ a question-and-answer style, where you ask it questions in natural language and it responds with a specific answer. A Google search typically returns many responses to your search terms and phrases. It is a relatively flexible system because it returns all matching results while ranking them. A Google search also cites sources.
Information sources: When you do a search on Google, the system taps into the internet’s humongous index of web pages and other content in order to find what may match your search request and load it. In contrast, leading LLMs tap into information sources that it was trained on, comprising a corpus of text data leading up to a specific date, which often has a lag between that date and the current date of the user’s prompt.
Specificity of answers: When you search on Google, you will likely see a range of web pages, articles, and other resources you need to browse and scroll through in order to find the specific information you are looking for. Leading LLMs attempt to provide you with specific answers—directly relevant to what you are looking for—without requiring you to do all the searching. This of course may also result in hallucinations, which is another way of saying AI is generating nonsensical outputs, or simply providing factually wrong information. We should remind the reader that by saying AI or leading LLMs hallucinates, we anthropomorphize AI. As a nonhuman object which lacks many human qualities, an AI is not capable of experiencing literal hallucinations.
Making new things: A Google search is a method of finding information on the web that has already been made. An LLM can not only find that information but also analyze, make new text, and explain or argue some conclusion.

LLM-powered chatbots will answer our questions about our health, and LLM-powered diagnostic tools will help doctors diagnose diseases more accurately. Clinicians will engage medical LLMs to develop personalized treatment plans and monitor patients’ progress.

LLMs will revolutionize how consumers and patients navigate their health and healthcare systems. By providing personalized insights, recommendations, and support, LLMs can enable patients and consumers to assume greater responsibility for their well-being and make well-informed choices regarding their healthcare.

LLMs could revolutionize healthcare in several ways:

Personalized health education: LLMs provide consumers and patients with customized education about their health conditions, treatment options, and prevention strategies. Generative AI can be used to create personalized educational videos allowing clinicians to tailor the education to the individual’s specific needs, language, and preferences.
Medical decision support: Consumers and patients using an LLM chatbot app can assist in helping with educated choices concerning their healthcare. The chatbot can perform product or plan comparisons of various treatment alternatives, and explain the advantages and disadvantages of each option using a variety of modalities like video. This would occur with dispensing medical advice or clinical advice as the chatbot would only organize and summarize data and content already readily available and provided to the patient. The chatbot operates as a tool for understanding the content.
Navigation assistance: LLMs help consumers and patients navigate the complex healthcare system to find qualified providers, schedule appointments, and understand insurance coverage. Using a chatbot to scour the internet (e.g., ratemds.com, vitals.com, healthgrades.com, or Yelp) to find and summarize patient reviews of a provider or specific clinician could be critical to one’s health. Although these reviews are subjective, a tool like a chatbot, which summarizes such data, allows a consumer to make a more informed choice.
Emotional support: LLMs support the emotional health of consumers and patients. LLMs can listen to concerns, offer encouragement, and connect patients with others facing similar challenges. The conversational nature promoted by LLMs offer an opportunity for a dialogue that supports and empowers consumers and patients.

LLMs will change the current personalization landscape that patients and consumers face, leading to increased personalization of healthcare. This will include coaching support, providing more individualized information and recommendations. LLMs could empower patients and consumers to be more responsible for their own health and well-being by making well-informed decisions about their health.

The choice of an LLM-powered chatbot lies in the power of conversational AI enabled by LLMs.¹⁵ A few examples of LLM-powered chatbots in practice might include any of the following:

A person with chronic health conditions might use a chatbot with LLM-powered capability to track and document symptoms, help manage prescriptions, and provide tailored information on living well.
A terminally ill patient facing a choice—to have surgery or not—might use an LLM chatbot to quantify the risks and benefits of each option and get advice tailored to her own level of risk aversion as input to a conversation with her doctor.
A caregiver for a patient with a chronic health condition, for example, might use an LLM-powered chatbot to coordinate appointments and care among several providers, provide explanations and context for what providers have said, and help navigate decisions.

In each of these examples, an LLM-powered chatbot offers certain benefits over a generic machine learning approach.

Natural language understanding: We’ve already mentioned how adept LLMs are at natural language understanding, which in turn empowers human-sounding natural language input (e.g., asking questions about symptoms in our own words), which is more intuitive and accessible compared with filling out a structured form or keyword-based search.
Contextual awareness: LLMs can hold context during the course of a conversation and understand how pieces of information relate to each other, allowing the chatbot to provide more informative and less repetitive or meandering answers to the patient’s questions. The bot can then track how its answers have changed over the course of its interaction with the patient, based on the context from the patient’s initial statements describing their symptoms, prescribed medications, and lifestyle factors.
Personalized support: By having a dialogue with the user and learning about their specific situations and problems, an LLM-powered chatbot could offer helpful advice and suggestions tailored to an individual’s health condition as well as their treatment plan and their lifestyles, which could be more meaningful and useful.
Prognostic support: LLMs can derive signals from the stream of information a user provides over time, and synthesize these insights into what trends have occurred in the patient’s records. Armed with all this data, the chatbot could, for example, flag an issue to the user or automatically refer them to helpful resources or preventive care that the algorithm thinks they could benefit from, with the goal of ultimately enhancing health outcomes.
Emotionally intelligent support: LLMs can be trained to communicate in an understanding, supportive voice. Individuals who struggle with the daily challenges of chronic illness can benefit from having a supportive conversational partner to keep them motivated and maintain their mental health.
Scalability: Most ML models require explicit training in the nuances of each new task or capability that you want them to perform. However, because LLMs can effectively adapt their already general knowledge of language to support even quite different kinds of tasks on quite different topic domains, it becomes easier to scale and adapt the chatbot to handle more diverse user needs and to expand the breadth and depth of its knowledge base over time.

Alternatively, a bespoke ML model could draw on less structured data inputs and simpler logic, and could not offer much support or broader context. This, in turn, might mean less customization for the user, a shallower and narrower scope for support, and more work over time to build and maintain.

For example, an LLM-driven bot does not require a definition of mental pain or a functional classification. This suggests that LLMs can operate without rigid, predefined categories or definitions, allowing for more flexible and nuanced interactions. However, it would nonetheless bring to bear its own capabilities—including some of the advantages of natural-language interaction, contextual understanding, and “deep” knowledge integration—to deliver a type of broad, tailored support that seems promising in helping people with chronic health conditions.

LLMs will help to equalize access to healthcare. Patients and consumers will be able to seek and access higher-quality health information and advice. Healthcare professionals and the broader health system will be better able to help realize patients’ potential for wellness.

The point is not simply that LLMs can do things for us. We can use LLMs and generative AI to enable us to be healthier and happier. In the next section, we will sketch some of these future apps or applications.

Promise and Possibilities of LLMs in Healthcare

Eight million people are dying every year worldwide who would have lived if they had better access to healthcare.¹⁶ Medicine and healthcare are on the cusp of a tsunami of change as LLMs and generative AI fundamentally transform medicine. LLMs trained on large healthcare and medical data linked to cutting-edge breakthroughs in artificial intelligence will facilitate a personalized form of healthcare.

Coupling the knowledge captured in their training corpus with data from a patient’s chart has the potential to dramatically advance clinical decision support systems, and ultimately improve care and outcomes for patients. LLMs can help physicians make more precise diagnoses, identify the best treatment, and even predict patient prognosis.

In a future where LLMs have been embedded in clinical decision support systems, doctors might have access to an almost inexhaustible supply of medical knowledge at the point of care. With such tools, physicians might be able to reduce medical mistakes. LLMs created to help clinicians who might themselves hover on the verge of a mistake get steered away from danger.

LLMs could provide clinicians with a degree of real-time care that has never existed before, by tracking medical notes in EHRs, data from home-based devices, and patient-entered information on digital platforms. This approach could create an early-warning system of symptoms, signs, and laboratory test results suggesting worsening illness. By identifying health problems early on, LLMs offer a great opportunity in helping to prevent the onset of chronic diseases, which can adversely affect patients’ health-related quality of life and often lead to a high financial burden for the healthcare system.

Beyond that, LLM-derived insights can inform precision health approaches, aiming to optimize primary, secondary, and tertiary prevention, as well as treatment interventions according to each individual patient’s genetic, environmental, and lifestyle characteristics. As a result, precision healthcare could be crafted to optimize treatment responses, increase patient engagement and regimens adherence, and improve health outcomes, as much for individuals as for populations.

Given the rapid progress of LLMs, as well as their inevitable integration into other disruptive technologies in the future, the potential of AI to develop a truly predictive, preemptive, and individualized system of healthcare is multiplied. Big data for healthcare and AI/LLM holds the promise of making preventive and preemptive medicine the new normal.

As new AI features such as agentic reasoning, retrieval-augmented generation, and bigger prompts are introduced into LLMs, the value of LLMs increases. They can handle more pointed and more nuanced queries based on more medical and other domain-specific “data”; reason about hypothetical scenarios; and reply to questions with replies that seem context-aware and personalized. Today, clinicians may use one of several apps for medical questions, like the UpToDate app.¹⁷ The adoption of LLMs can improve the functionality of such apps in areas of search, summarization, user interface, and more.

Imagine two healthcare scenarios, each harnessing the power of apps driven by LLMs and generative AI. These AI cutting-edge apps seamlessly integrate conversational AI, advanced search functionality, and intelligent summarization capabilities, revolutionizing the way patients and consumers interact with technology and access information. Let’s delve into these hypothetical scenarios and explore the potential impact of these AI-powered apps in healthcare.

In the first scenario, a Medical Swiss Army Knife is the name of a consumer app focused on helping patients and consumers as they engage and navigate the healthcare system. In the second scenario, a Medical Sherpa, is the name of a clinician app designed to be a companion or virtual assistant to a clinician. In both scenarios, the LLMs are trained or augmented with trusted knowledge sources, clinical data, pharmacy data, EHR data, and more.

Medical Swiss Army Knife App for Consumers

An AI startup company introduces a novel chatbot, built using a medical-specific LLM. The chatbot is a healthcare app called Medical Swiss Army Knife, which orchestrates multifunctional capabilities for consumers or patients in healthcare contexts such as scheduling doctor appointments, summarizing a patient’s history, and listening to doctor and patient dialogue in order to provide plain and intelligible summarization of doctor instructions. Medical Swiss Army Knife also offers provider steerage to help users navigate and identify the probable optimal provider for their medical condition.

David, a 75-year-old man, is in love with his Fitbit wearable. Over several weeks, he repeatedly receives a signal detecting atrial fibrillation (AFib) and contacts his doctor, who refers him to a cardiologist. David takes medication for high blood pressure and statins to control his cholesterol. David recently had a calcium scoring test showing him in a high-risk category. His cardiologist recommends and performs an AFib ablation, but it does not fix the problem. David is rechecked into the hospital to give his heart a controlled electric shock and cardioversion to restore a normal rhythm, but to no avail.

David wonders if there is an alternative to the AFib ablation that he should consider. He talks with his doctor, who advises him that the AFib has good results and they should try again, as this hospital specializes in this procedure to treat AFib. David has the Medical Swiss Army Knife app on his iPhone, a recommended download from his wife Ann, and decides to use it to research his question on AFib ablation alternatives. The Medical Swiss Army Knife app uses a medical-specific LLM, a foundational LLM like Google Gemini, combined with data from David’s medical records, medical history, and health information. The app informs David of another procedure, a catheter ablation. Showing David verified videos of a preeminent research hospital and physician specializing in this procedure. David is intrigued and consults with his physician, who advises him that this is an alternative treatment that he cannot provide and that David should contact the research hospital to learn more.

The app starts a conversation with David about his calcium scoring test showing him at high risk. It informs David that a cardiac computed tomography (CT) scan would most likely be performed at the research center before the catheter ablation to help his attending physician anticipate potential difficulties during the procedure.

David uses the Medical Swiss Army Knife app to contact the hospital and make the initial phone appointment to learn more. David enjoys the conversation and finds himself enlightened, deciding to pursue treatment at this research hospital. The app makes the appointment, flight, and hotel reservations. David engages in a conversation with the Medical Swiss Army Knife app to better understand what questions he should be asking. The app suggests that David ask the following:

What is the best treatment plan for me, given my circumstances?
What are the different treatment options available, and what are the risks and benefits of each?
How does my AFib affect my heart?
What is my risk of stroke?
What should I do if I have an AFib episode?
What are the long-term implications of living with AFib?

The app development, by a reputable company, employs state-of-the-art security measures to protect patient privacy. The app design attempts to avoid misinterpretation of the conversation or providing inaccurate information by:

Using a large and diverse dataset to train the LLM: This dataset includes medical conversations. This helps the LLM learn the nuances of medical language and avoid making mistakes.
Using state-of-the-art NLP techniques: These NLP methods are used to comprehend the conversation effectively. This, in turn, assists the LLM in pinpointing the essential aspects of the discourse and refraining from drawing unsupported inferences.
Incorporating feedback from doctors and patients: This app improves the accuracy of the LLM. The app’s continuous feedback loop helps identify areas where the LLM is struggling and make necessary adjustments.
Providing transparency to users: The app allows users to find out about how it works, and it uses their data to help users understand the app’s limitations and use it responsibly.

The LLM Medical Swiss Army Knife app reminds David that it is not a doctor and cannot provide clinical advice or diagnoses. It informs David that he should seek medical advice from his AFib doctor before making decisions about his care. David and his wife fly 2,000 miles and check in to the recommended hotel adjacent to the hospital. Both are immediately impressed as the cardiologist phones and asks if he can stop by and say hello. This personal service is beyond their expectations. Before the doctor’s meeting, David opens the Medical Swiss Army Knife app to check on the questions he wants to ask. The app prompts David if he would like the app to listen in on the conversation. David informs the doctor he is using an app that will listen to their conversation and help David better understand the conversation afterward. The doctor smiles and says of course, and reminds David that he would be happy to answer any questions he has any time before or after the surgery.

It is now Monday and time for a pre-procedure CT scan in preparation for the Isolator Synergy ablation clamp to treat David’s AFib. The CT scan shows severe blockage in his main arteries, and the cardiologist cautions David that he is at high risk of a heart attack, so much so that he needs immediate open heart surgery because of the blockage.

David begins conversing with his Medical Swiss Army Knife app, asking if his local doctors should have discovered this blockage. The app informs David that further tests may not have warranted it because he had no reported symptoms. It also advises him to ask his treating cardiologist and local doctor this when time permits.

Without using the Medical Swiss Army Knife app, David would have remained solely engaged with his local cardiologist, unaware of his high risk of a heart attack. Although perhaps just fortuitous, David would never have undergone the CT scan showing severe blockage but for seeking a catheter ablation.

David entered what was expected to be a three- to four-hour surgery but was instead six hours. The doctor completes the surgery and tells David’s wife, Ann, what occurred. He says the reason that David’s surgery took longer is that he had a physical abnormality, causing blood to go from his lungs to his heart in a way that the doctor had never seen or that anyone he knows had ever experienced.

The doctor emphasizes he has been doing this for decades, even working with babies with congenital heart disease and birth abnormalities, and has never seen anything like it. It took them time to try to get to the bottom of it, and instead of using one pump to recycle the blood, they had three of them working, which wasn’t enough.

We would be remiss not to mention why Ann had such confidence in the Medical Swiss Army Knife LLM medical app. She was diagnosed with CLL leukemia four years earlier. She had an appointment with an oncologist on a Monday and received a call from her daughter the previous Thursday. Her daughter was an active user of the Medical Swiss Army Knife app. The app suggested her mother would receive the best outcome at a cancer research hospital versus the local hospital she had planned for treatment. Her mother was not too keen on rescheduling her appointment as she liked her oncologist, and the local hospital was a short drive away compared to the research hospital. But she relented, canceled her appointment, and made an appointment to see an oncologist at the research hospital.

The research hospital had a slightly different treatment plan, which included a recently available FDA drug, IMBRUVICA®. Ann was quite pleased with the results and currently finds her cancer in remission. She credits her daughter and the app directing her to a care facility that produced better CLL leukemia outcomes. Ann understood that clinical outcomes could differ drastically based on the provider, and she was delighted that she got her husband, David, connected to an expert in treating AFib. She firmly believes it saved her husband’s life. It’s no secret that medical facilities that released research findings achieved elevated patient satisfaction scores and exhibited reduced patient mortality rates across a variety of medical conditions and procedures.¹⁸

By leveraging expansive data on providers’ clinical outcomes, the Medical Swiss Army Knife app, powered by an LLM, can match individual patients with the physicians statistically poised to provide the most effective treatment for the patient’s particular condition profile and risk factors.

Medical Sherpa App for Clinicians

Dr. Davis had been a primary care physician for over 20 years and had seen it all. But when his patient, John, came in for a routine physical checkup, Dr. Davis noticed something that made him pause. John had a small lump on the side of his throat. “John,” Dr. Davis said. “I’d like to have a closer look at that lump on your throat.” John nodded, so Dr. Davis palpated the lump with his relaxed fingers and furrowed his brow. The lump was firm and fixed, and it did not move under gentle finger pressure. “I am worried,” Dr. Davis said, “that this lump could be cancer.” He continued: “It would be my recommendation to follow up with a specialist immediately to be on the safe side.” John looked fretful. “But I do not feel ill,” he said. “I do not have any symptoms.”

“Cancer can often be asymptomatic early on,” Dr. Davis added. Reluctantly, John agreed to see a specialist. Dr. Davis consulted his Medical Sherpa, an LLM diagnostic app that could sift through mountainous factual knowledge.

Dr. Davis described the lump to his Medical Sherpa. The app bounced back with several suggestions, including asking for a fine needle aspiration (FNA) biopsy—a minimally invasive way to extract a cell sample from the lump—and directing John to an otolaryngologist, the right specialist for diagnosing and treating ear, nose, and throat conditions.

Following the guidance of the Medical Sherpa, Dr. Davis ordered an FNA biopsy for John. He also referred John to an otolaryngologist. Several days later, the results of the FNA biopsy came back positive for cancer. Dr. Davis called John, delivering an asymmetric shock of information: “I’m sorry to say that you have cancer, but we caught it early, and you can still receive therapy. Would it be OK,” Dr. Davis asked John, “if my Medical Sherpa helped you schedule an appointment with the otolaryngologist to discuss your treatment options?”

Medical Sherpa is a clinician-facing LLM app that is typically used by physicians seeking a consultation. A somewhat common practice in medicine is hallway, elevator, or curbside consultations. A Medical Sherpa app is in essence a consultation, albeit brief and informal, between a clinician and an LLM. The use of the name “sherpa” is apt because, similar to the guides who assist climbers ascending Mount Everest, the Medical Sherpa will assist clinicians in navigating through complex medical terrain. LLMs are envisioned to act as virtual assistants practicing beside physicians, offering insights and completing tasks. However, an essential component of the human in medicine is the guiding hand of clinical judgment.

There are both general reasons for medical sherpas to facilitate better and safer care as well as specific benefits to clinicians. For instance, when physicians work with their sherpas, they obtain proximate knowledge that is not afforded when relying upon data and analytics from afar. Similarly, the sherpa of nurses, being at the bedside, is better able to provide real-time support and advice via ubiquitous communication, enabling the nurse to make more informed decisions.

Furthermore, medical sherpas can assist providers in increasing productivity by saving them time. With fast and easy access to consultation and support, doctors and nurses can use the time saved to focus attention on other critical aspects of their job, which could help improve healthcare outcomes.

In addition, by reducing provider burnout, which has become recognized as a serious problem in healthcare, medical sherpas allow clinicians to spend more time caring for patients and less time training new learners for each case. Having someone with ongoing experience in this form of care can make a huge difference in the experience and confidence levels of clinicians here at home. Together, these benefits can lead to better quality of care for patients and a more sustainable system for the future.

LLMs’ Emerging Features

LLM-powered applications occupy an interesting space between a tantalizing vision for the future and a daunting series of obstacles to overcome. We’re very close to a future where LLM-based systems can tackle increasingly complex tasks, free humankind’s creative impulses in new directions, and fundamentally change how we interact with the world around us. But first, we must progress on technical frontiers involving data, performance, stability, and security.

There’s a human side to this other than technological infrastructure. There are matters related to privacy concerns around data-hungry LLMs. Bias, baked into the training data, creates the need for continuous monitoring and proactive mitigation strategies to prevent the reproduction of biases and harm in healthcare settings.

This means that while we haven’t reached our destination yet and while technology by itself won’t get us there, we are inching our way forward. Social, ethical, and conceptual thinking will be vital to scaling up responsible design approaches, making LLMs tools for improving physicians efficiency and effectiveness and patient-doctor interactions while preventing them from becoming tools of exclusion and harm.

The current form factors of LLM-based apps capabilities offer broad utility for healthcare, with the potential to provide assistive convenience into consumer lifestyles and healthcare operations. From our smartphones’ symptom checkers to clinical decision support in the back office, LLM use cases amplify the potential for better healthcare at numerous points along the spectrum of patient-doctor interactions.

Even though true game-changing innovation remains just beyond the horizon, we can see today that AI is already reshaping clinical spaces and consumer health tech to improve workflow efficiency and patient care. The book AI-First Healthcare¹⁹ documented numerous examples of how AI makes healthcare better. LLMs take AI another step forward, and automated note-taking, conversational chatbots, and summarization tasks are just the beginning.

More than any other emerging technology, LLMs promise an ongoing increase in social benefit—making existing systems aware of holes in nurses’ care, redirecting decision trees, and maximizing outcomes for every patient through both provider and purchaser empowerment. The here and now of this optimistic treatment of our shared future is occasioned by the arrival of consumer and business LLMs into our lives.

There are exciting changes ahead in the near future for LLMs, which include expanding prompt windows or what is called context windows. The window size continues to expand, and researchers are working on a prompt that allows for functionally infinite size.

Infinite Context Prompts

LLMs with extensive or unlimited context windows can now process text, audio, and video data simultaneously. This advancement opens new and enhanced possibilities for healthcare providers, health plans, and payers. This is interesting for clinicians because it could strengthen patient consultations by analyzing diverse data types in real time. Here are some of the ways this AI improvement might transform healthcare:

LLMs with access to medical literature, clinical notes, and guidelines can offer clinicians point-of-care, evidence-based recommendations for diagnosis, treatment, and care planning in real time. However, as with humans, there may be a pause (i.e., latency) in the reply, depending on the complexity of the prompt. By evaluating patient data alongside the medical literature and clinical best practices, LLMs might assist clinicians in reducing medical errors and enhance decision making to improve patient outcomes.
Models that can understand and generate natural-sounding text, audio, and video can enable more meaningful interactions between patients and clinicians that span across language barriers. LLMs could help transform complex medical information to versatile natural-sounding text that can be understood by a wider variety of patients, answer common queries, offer nuanced patient education that can be personalized to meet individual needs, and encourage early intervention. These interactions could then enhance patient engagement, adherence, and satisfaction with care.
Different LLMs could help automate paperwork and clinical documentation, including coding and billing, streamlining healthcare processes and liberating providers from the burden of administration so they can spend more clinically “face-to-face” with patients. Today, companies like Google offer technology that allows one to use LLMs to summarize an email with embedded video. Imagine what this would be like where the input stream is not bounded by a fixed size.
Models that can parse audio and video in real time would improve the efficiency and efficacy of telemedicine and remote monitoring services, helping with remote consultations.
With the capacity to analyze and synthesize massive amounts of biomedical literature and data, including scientific publications, clinical trial data, and patient records, LLMs can expedite medical research and drug discovery. Clinicians could save time with the power of the LLM to summarize clinical trial data or patient notes spanning years.
LLMs could enable personalized medicine and precision healthcare in delivering tailored care to individual patients based on their unique characteristics (e.g., based on genomic profiling, lifestyle, and medical history data) to identify personalized risk factors, disease trajectory, and therapeutic interventions and treatments. A more personalized approach to care, potentially enabled by LLMs, could increase the effectiveness and efficiency of healthcare delivery by optimizing patient outcomes.

The promise of personalized healthcare would be a big step forward. LLMs with infinite context windows or prompts could process and store large amounts of medical literature, clinical trial data, patient medical history, and clinical data, allowing for a comprehensive and updatable medical knowledge base for a patient or consumer. Chatbots powered by such LLMs would expand to more complex multiturn conversations, creating intuitive and engaging consumer experiences. The former Google CEO, Eric Schmidt, sees the expanding infinite prompt windows occurring within the next five years.²⁰

Agentic Reasoning

Agentic reasoning represents another evolution of AI where systems can act autonomously. Andrew Ng, a computer scientist and AI researcher, provides interesting perspectives on the nature of agentic reasoning, and describes four key features or patterns of agentic reasoning that we will explore in this chapter: the patterns of reflection, use of tools, planning, and multiagent interaction.

“Agentic reasoning lies at the heart of creating agents that can take actions aimed at achieving goals,” says Andrew Ng,²¹ adjunct professor of computer science at Stanford University and cofounder of Coursera, a company that offers massive open online courses. As Ng explains, it means an AI system’s capability to sense, desire, believe, and act, thereby setting and modifying goals, making decisions under uncertainty, learning from its experiences, and interacting and reasoning with humans and other AI agents in a natural and effective manner. The challenge of achieving agentic reasoning among AI agents, he points out, demands significant advances in multiple areas, such as machine learning, NLP, knowledge representation, and reasoning under uncertainty.

The four patterns of agentic reasoning

The reflection pattern in agentic reasoning helps AI improve its performance based on what it has done previously. The reflection pattern allows a healthcare AI system to reflect on its choices, identify ways to improve the outcomes, and continually develop its approach to patient care. For instance, an AI agent designed to provide clinicians with diagnosis and treatment recommendations for a complex disease could adopt the reflection pattern. The agent would have been trained initially on a large diverse dataset of patient records, literature, and clinical guidelines, and it would then make agentic recommendations to the clinician taking into account the prevalent data.

Initial diagnosis and treatment plan: When a new patient case is submitted, the agent will analyze the presenting patient’s symptoms, medical background, and test results, then provide an initial diagnosis and treatment plan. The agent will use its training data and apply its agentic reasoning to the situation, as well as modeled data about the modules that make it up, to determine what is likely the true cause of the patient’s condition and what the best treatment plan will be.
Reflection on outcomes: Once a patient is put on a treatment plan, the AI agent monitors the patient’s course and results as he or she goes along. What the patient achieves will be compared with what the agent would predict for the same patient given its initial recommendations. If the patient improves as the agent expected, it will reinforce itself and grow more confident in similar cases in the future.; But if there is no improvement in the patient’s condition or if the outcome is otherwise suboptimal after a certain period, the AI agent would examine why it made the decision it did—by looking at its algorithms, the data it used, and the assumptions it built into it.
Adaptation and learning: On the basis of this reflective analysis, the AGI lets the patient’s case anchor the adjustment it needs to make to its decision-making mode. For example, the AGI might add a record of the clinical findings to its background knowledge, refine an algorithm to incorporate an empirically known patient-specific nuance, or revise a list of treatment recommendations to reduce the odds of a known complication.; Crucially, this adaptive training process means that the agent is continuously learning to take more actions that improve its behavior in the long run and so ultimately make better recommendations—reducing chances of errors and prompting more appropriate remedies. When it has experienced more patients and engaged in this process of after-action, it can diagnose and treat more complicated medical problems.
Sharing insights and collaborative learning: This knowledge can be shared between AI agents and human experts by means of the reflective insights they acquire, thus enhancing colearning and coknowledge between humans and AI agents. For example, multiple AI agents can work collaboratively to recognize patterns and generate novel treatment strategies and refined patient care on a large scale.; The AI agent can provide feedback to human physicians, pointing out the places where they need to update their clinical practice or need additional research efforts. By engaging in this kind of a human-machine dialogue, we can ultimately enhance the hybrid nature of work between humans and machines.; The reflective structure of agentic reasoning allows AI agents working in healthcare to learn from their experiences, adjust their strategies, and continuously improve their ability to diagnose and treat patients. Through a continual process of reflection and collaborative learning with human experts, AI agents can become a complement to human care, enhancing quality, efficiency, and effectiveness of healthcare delivery. It is imperative that the reflective process is properly directed and informed by robust ethical principles, and that human oversight is always in place in order to prevent the unforeseen negligence and maintain the highest standards of care.

The tool use pattern in an agentic reasoning enables AI agents to leverage tools and external resources broadly, moving beyond machine learning, computer vision, or NLP to expand their problem-solving scope and decision-making process by harnessing external resources and knowledge capabilities. For medicine, the tool use pattern can enable AI systems to “borrow” medical resources via incorporating existing medical tools, databases, services, and all other external inputs, such as medical professionals like nurses, doctors, caretakers, and others. These inputs can provide principled and human-centric patient care based on up-to-date clinical know-how and professional decision-making rather than confining AI systems to “black box” decision making that relies exclusively on machine learning examples. Let’s look at precision medicine and illustrate how the tool use pattern can be applied.

A healthcare AI agent assists physicians in developing individualized treatment plans for their cancer patients. To do so, the agent uses agentic reasoning to analyze patient data and find optimal treatment options that the patient can follow, and the agent also monitors treatment progress. In order to improve its treatment recommendations further, the agent employs the tool use pattern to access and also combine with external resources and services.

Genomic analysis tools: An AI agent mines the genomic analysis toolbox to collect and make sense of the patient’s genetic information. Armed with databases of genomic variants and their known clinical implications, it can identify potential genetic risk factors, suggest likely drug responses, and prescribe a targeted therapy based on the patient’s individual molecular profile.
Medical imaging services: The medical imaging services—such as computer vision APIs—that the AI agent relies on analyze patients’ scans (MRI, CT, or PET scans) to detect and characterize the presence and shape of tumors, as well as treatment effects and disease progression. This information, combined with insights from other patient data, feeds into the overall assessment of the patient’s condition by the AI agent.
Electronic health record (EHR) systems: Utilizing EHR systems for accessing patient’s previous diagnoses, treatments, and outcomes will help the AI agent to construct a more accurate approach to the actual treatment. For example, rather than consulting the EHR of that particular patient, it can consult the EHRs of other patients to gain a more comprehensive view of the patient’s health status and potentially identify risk factors or comorbidities, which may affect the selection of treatment regimen. With access to data from an integrated system of EHRs and other related hospitals, the AI agent will be able to generate a more personalized care plan and related decisions.
Clinical trial databases: The AI agent searches clinical trial databases for trials relevant to the condition of the patient, and then examines eligibility criteria for trials, data on how to treat participants, and data on outcomes. This enables the AI agent to make a recommendation about trials a patient might benefit from joining or to use trial data for its recommendations on treatment.
Drug interaction checkers: The AI agent uses drug-interaction checkers to assess proposed cancer treatments for potential interactions with the patient’s current medications. It then recommends alternative medications or dosing changes based on the outcome as a way of minimizing adverse drug events or contraindications while maximizing efficacy.

Using these tools and services, the AI agent can then offer physicians an integrated approach toward precision medicine by compiling the relevant data from different sources and providing personalized treatment advice vetted by knowledge graphs and probabilistic scoring. This approach is feasible because the agent can journal-crawl, medical-text-mine, download, image-store, and integrate disparate data in a probabilistic manner. It can use patient medical history, genetic data, and imaging data to suggest appropriate therapies, including potential prescriptions, based on lesser-known drug interactions.

Furthermore, because the AI agent is making some of the decisions about new research data, clinical guidelines, and new or untested treatment routes, the agent’s tool use profile is largely self-updating—changing with the evolving patterns of discovery in human cancers. The agent will therefore be using the best and most up-to-date knowledge available.

And as the field of agentic reasoning develops in healthcare, this tool use pattern will come to play important roles in building AI systems that can capture, combine, and handle large amounts of diverse medical data needed in precision medicine to provide better patient care—as long as the services used externally to achieve these results respect robust data privacy, security, and ethical rules in order to maintain patients’ privacy and integrity of the healthcare system.

The planning pattern in agentic reasoning is essential for giving AI agents the ability to craft high-level plans to achieve their goals and optimize processes. This means that, in the healthcare domain, a planning-enabled AI system could be used to work through a detailed patient case, anticipate potential outcomes, and decide upon the best treatment plan before creating it—integrating a wide array of factors and parameters. Consider, for example, the scenario of an AI agent designed to assist physicians in managing patients with chronic diseases, such as diabetes, hypertension, or cardiovascular disease. In this case, the agent is using agentic reasoning to analyze the results of a physical examination, sequence the symptoms that arise, identify factors that place the patient at risk for worse health outcomes, then create strategic and adaptive recommendations for long-term health outcomes.

Goal setting and problem decomposition: The AI agent begins with an abstract objective of optimizing health outcomes and quality of life of the patient, and breaks it down into smaller, more specific subgoals: keep a patient’s blood sugar in an optimal range, decrease blood pressure to a safer level, minimize the risk of amputation or renal complications, and so forth. By breaking down the overall problem into distinct subproblems, the agent can formulate and pursue actions that are attuned to each particular aspect of the patient’s condition.
Data analysis and situation assessment: Then, the AI agent tries to mirror the entire medical situation of the patient according to its context. It takes into account the patient’s medical history, current health condition, and environmental context, as well as his lifestyle background and any identifiable idiosyncrasies. This includes the ability to integrate data from EHRs, wearables, and patient-reported outcomes.
Plan generation and evaluation: Drawing on this situation assessment, the AI agent generates different possible treatment plans that address the defined subgoals. For example, it might include one that involves using a different combination of medication adjustments, lifestyle changes, and referrals to specialists. The agent evaluates each plan by considering predicted effectiveness and side effects, patient preferences and acceptance, available resources, and so forth, using known data and probabilistic predictions before deciding which course of action to recommend.
Plan selection and adaptation: The AI agent would choose the treatment plan with the best value in its opinion, balancing benefits against the risks of treatment. Then it would communicate the selected plan, with supporting justification, to the physician and patient, perhaps also with instructions or support for implementing the recommendations.; Whereas the physician designs the treatment plan, the AI monitors as the plan unfolds and examines the results. If the patient’s condition is not following the predicted trajectory, the agent replans. The treatment becomes responsive to new information, such as changing medication dosages or introducing different interventions or lifestyle recommendations.
Continuous monitoring and refinement: The AI agent checks back in later with the patient to see how she has fared and whether the treatment plan is helpful or requires adjustment. It is also on the lookout for risks and adverse events from side effects. Where it can identify patterns in the patient’s own data and compare with the trajectory of similar cases, the agent can alter its planning strategies to better guard against emergent health problems.; This planning structure of agentic reasoning can help AI agents in the healthcare context to come up with executive, dynamic strategies of care management in chronic conditions. The AI agent breaks down the complex health problems into meaningful subgoals, runs pattern completion procedures using available patient data, brainstorms feeding and elimination options, provides an analysis of expected consequences, and monitors and self-adjusts its strategy. This way, the AI agent can assist the physician in providing personalized, evidence-based care by balancing short-term costs against long-term health benefits.; It’s unlikely that the planning pattern will be the only pattern needed as we progress in the exciting field of agentic reasoning in healthcare. But it will be essential in creating AI systems to help clinicians in the management of chronic diseases that make up a significant portion of the patient population, and it will keep them on the path toward equitable healthcare. We must be protective of the planning process—guiding it with ethical principles, clinical best practices and, patient-centered values to protect individuals from unsafe, ineffective, and unacceptable treatment plans that may emerge as the healthcare caseload continues to add mundane tasks for clinicians.

The multiagent collaboration pattern is the means by which the agentic architecture realizes the collaborative work of diverse agents, whether they are conceded agency or not, across different levels of the agentic ontology. Collective event recognition requires two or more agents’ awareness and evaluation of an event. In the healthcare domain, the multiagent collaboration pattern is in play when two or more intelligent agents—which can be thought of as AI systems and independent broad-scope healthcare professionals—coordinate their work, share state knowledge or perception, and make decisions and actions based on shared goals or subgoals.

Imagine a patient who has all kinds of long-term conditions—diabetes, hypertension, cardiovascular disease, and so on—and needs advice, monitoring, and treatment from a wide range of health professionals (e.g., a multidisciplinary team of physicians, nurses, dieticians, social workers, and psychologists). In such a situation, a range of different AI agents might be deployed to support the members of the health professional team, for example, to help them optimize medication choices, deliver lifestyle coaching, coordinate care, and so on. These AI agents employ agentic reasoning and the multiagent collaboration pattern to pool skills and working memory to deliver well-targeted, well-informed, coordinated care.

Shared goals and problem understanding: The AI agent and the human expert jointly define the patient’s status in terms of health, treatment goals, and potential barriers. Finally, they co-construct a personalized care plan, where the respective strengths of the human and algorithm help provide the best possible treatment for a patient’s medical, psychological, and social needs.
Task allocation and coordination: Specific to their assigned tasks, the AI agents allocate some of the work. The medication optimization agent may scan the patient’s prescriptions for drug interactions and suggest ways to optimize efficacy and safety. The lifestyle coaching agent can personalize diet, exercise, and stress-management recommendations to complement the patient’s self-care regimen.; The care-coordination agent stands at the center, gathering information from the myriad care agents and connecting each of them with the specific information they need. The care-coordination agent also ensures that the other agents and human experts are aware of the patient’s current status, change of status, and change of care plan.
Information sharing and knowledge exchange: The AI agents and human experts constantly exchange information and insights that support collective decision making and problem solving. They transmit patient data, treatment recommendations, and clinical insights via encrypted channels and standardized data formats so that each agent and expert can draw upon the collective knowledge of the whole group and update its strategies accordingly. For example, if the medication optimization agent detects a potential adverse drug event, it tells the care coordination agent, which in turn alerts human experts but also other AI agents. The team assesses what’s happening and generates an account of the event. They consider whether to remove an offending drug and replace it with a different option. If so, they update the care plan.
Conflict resolution and consensus building: If there are conflicting recommendations or opinions among the AI agents or humans, the multiagent collaboration pattern enables them to engage in argumentation and dialogues, negotiating on trade-offs and reaching a consensus assisted by argumentation, voting, or multicriteria decision analysis methods. This collaboration pattern makes sure the agreed decision is “in the best interests of the patient.”
Continuous learning and adaptation: If the patient’s condition changes and new data becomes available, the AI agents and human experts learn new strategies for the care coordination process, trading tips (as it were) that help make both their strategies more effective and efficient. The multiple agents interact to learn from each other’s successes and failures and develop new approaches in the face of new challenges over time.

This multiagent collaboration pattern from agentic reasoning allows AI agents and human experts in healthcare to work together in a coordinated way to provide holistic and personalized care for patients with complex health needs. Defining a shared goal, allocating tasks, sharing knowledge, resolving conflicts, and learning and adapting are among the components that can help the team to leverage collective intelligence to have a greater impact on optimizing patient outcomes and improving the quality and efficiency of care delivery.

Because agentic reasoning in healthcare is just beginning its evolution, the multiagent collaboration pattern will likely become even more important in the design of AI systems that can work shoulder to shoulder with their human counterparts—and even “learn from them” to complicate the increasingly diverse and interconnected healthcare landscapes. And ethical, profession-standard, and regulatory controls will be necessary to maintain the safety, privacy, and trust of patients and clinicians.

Challenges and future directions

These four different patterns of agentic reasoning provide an opportunity to scale AI to human levels of intelligence in many areas. There are, of course, huge challenges ahead, in determining how to ensure that agentic AI agents interact with humans in a way that is safe, ethical, and aligned. This will involve, for example, developing robust frameworks for value alignment, as well as mechanisms for holding such agents to account and ensuring fairness in their operations.

A second challenge is embedding the four agent-centered patterns of reasoning into unified, flexible, and scalable AI architectures, which could require advances in transfer learning, multitask learning, and open-ended learning to allow AI agents to learn knowledge in one task or situation to help solve another.

Agentic reasoning technologies will likely make significant progress in the long run. What continues to be very interesting about this area of research is that there hasn’t been as much work done for researchers to pursue. But it is certainly possible to envision that, over time, we might see AI systems that reflect and perhaps tool, plan, and learn with increasingly complex forms of reasoning and collaboration. Such advances could transform many fields, from healthcare, education, and transportation to manufacturing and beyond.

Context of Use When Using LLMs

Understanding foreseeable use cases of LLM apps recognizes the central importance of “context of use,”²² a term coined by Margaret Mitchell, when creating healthcare LLM apps. Perhaps Mitchell’s thinking lies in a long-standing software engineering practice of human-centered engineering. Because healthcare LLM apps are so open-ended in possible user prompts, they offer interesting use cases for improving the healthcare systems worldwide, yet they also offer challenges in preemptively predicting user interactions.

Unlike physical objects, which might have a finite number of intended use cases, most software apps are so open-ended in their interactions that we cannot fully predict how end users will ultimately use them. A chair can be used for a finite number of uses (such as sitting ), but an app is open-ended. A machine learning model may be developed to predict chronic disease. A disease model may be developed to predict a specific disease, such as heart disease or obesity. Another user or organization may choose to use a specific machine learning model to determine the cost of granting health insurance coverage, while another user may choose to apply the same machine learning model to deny health insurance coverage.

The flexibility of software means users can bend it to their task by using an application in ways that work best with their particular needs, workflows, and users. This productivity app could be designed for task management, but it might be used as a project collaboration tool. The openness of software—made possible through this intrinsic flexibility—means the organizations or companies making that software must also be ready for the ultimate challenge of human exploitation. With respect to the LLM-powered chatbot, the flexibility of natural language interaction means that its prompt openness is difficult to anticipate or restrict in terms of context and potential outcomes. Users might ask questions (or make requests) that are not intended in the scope set for the chatbot. It is possible that users will try to manipulate its responses into causes for harmful or inappropriate outcomes.

For instance, someone seeking a diagnosis might ask a wellness chatbot that was built for general health-related discussion, which could generate off-base or unsafe assumptions. Even the most lighthearted of chatbots is at risk of colliding with hostile or abusive interaction from a customer service bot that becomes unfairly criticized for its errors.

Mitigating these risks will require developers of LLM-powered chatbots to build layers of safety, codes of ethics, and content moderation tools. Examples might include using adversarial forms of testing—where a system is deliberately exposed to the widest range of possible inputs from users to identify holes in its training and specification of representativeness—to ensure that, for example, asking the bot not to be rude doesn’t cause it to spout racist comments. Whatever the strategy adopted, developers must make sure boundaries and expectations of impossibilities are clearly set out and communicated to users to reduce the risk of a user trying to force the bot to do the impossible.

Second, as noted previously, chatbots powered by LLMs should be constantly monitored and refined to ensure that they continue to perform as desired. Developers should ground this process by actively seeking the feedback of users regarding their daily experiences with their chatbots. Developers should also examine patterns in interactions. They should subject input and feedback to further analysis, and update knowledge bases and response systems accordingly in order to optimize how the chatbots perform under the new conditions created by their users.

To sum up, the open-ended nature of software applications, including LLM-driven chatbots, can present opportunities and safeguards for anticipating, planning, and addressing users’ interactions. The open-ended nature of an LLM-driven chatbot enables creators to foresee novel uses within its framework that can be beneficial to users. The open-endedness of an LLM-driven chatbot can also lead to unforeseen and harmful uses. Nevertheless, by implementing safeguards, morality guides, and continuous monitoring and improvements, creators can enhance users’ experience when utilizing LLM-powered chatbots.

Whether it’s vetting politically biased search results or catching linguistic markers of dementia, the value of applying context to LLM apps is clear. By designing LLM apps with context in mind, we can build healthcare tools that are more robust, ethical, and beneficial to patients. LLM apps should do the following:

Encourage responsible use by providing clear interfaces, educational materials for clinicians, and transparency about the limits of the AI.
Enforce safeguards against identified misuse scenarios (e.g., security controls for data, preventive measures to disable stacking against biased outputs).
Let AI improve upon itself as it is adaptively deployed in new contexts. Adjustments could occur by monitoring the AI in use in the world (to the extent possible) and tuning the model toward any problems that arise.

Consumer and Business LLMs

Today, we have apps and applications basically divided into two groups, consumer and business. They serve different purposes and target different users. Business apps or applications typically are designed for a company’s employees. However, we also have businesses creating apps for their customers to access health plans, understand benefits, make appointments, and more. Perhaps the biggest examples of consumer apps are in areas of social media, entertainment, productivity, gaming, and commerce, to name a few.

In healthcare, we see medical apps designed to make personal health easier. Apps to schedule doctor house calls (e.g., ZocDoc), therapy apps (e.g., Talkspace), telehealth apps (e.g., Doctor on Demand), women’s healthcare apps like Maven, and more. We expect to see over time more healthcare LLM-based apps emerge to cover many of the use cases described in Chapters 3, 4, and 5.

Consumer LLMs and Generative AI

This book explores a key hypothesis: the rise of consumer-focused applications powered by LLMs will significantly transform healthcare. These apps, leveraging LLMs’ abilities to summarize information and generate content, are expected to:

Enhance the doctor-patient relationship
Help individuals better manage their chronic diseases and overall health
Most importantly, intervene to delay or prevent the onset of chronic diseases

By harnessing LLMs’ capabilities, these consumer applications have the potential to revolutionize personal health management and preventive care.

Consumer LLMs are designed for individual users, offering various applications and functionalities tailored to personal needs and interests. These LLMs include models like chatbots, virtual assistants, and content generators. Here are some key characteristics of consumer LLMs:

Conversational assistants: Consumer LLMs such as virtual assistants (e.g., Siri, Google Assistant) are developed to assist users in setting reminders, answering general knowledge questions, sending messages, and playing music. They are designed for everyday convenience.
Engagement and entertainment: Consumer LLMs are often designed to provide interactive experiences—such as conversational AI assistants, chatbots, or creative writing tools—that aim to engage and entertain users.
Content generation: Some consumer LLMs can generate text, which can be helpful for tasks like drafting emails, writing creative content, or even coding assistance. These models focus on enhancing personal productivity and creativity.
Personalization: Consumer LLMs often prioritize personalization by learning from user interactions to provide tailored recommendations, content, and responses.
Personal assistant: These LLMs may assist with answering healthcare questions, providing recommendations, writing emails or documents, scheduling appointments with clinicians, and helping with various individual productivity tasks.
Accessibility: These models are often deployed with user-friendly interfaces, accessible to a broad range of users, and are often available on mobile devices and personal computers.

Business LLMs and Generative AI

Businesses and organizations design their business LLMs and generative AI, both for employee and customer uses, in order to automate tasks, interpret data, and generate things such as text, images, and video. Business LLMs are designed for use in organizations and enterprises, with the following characteristics:

Data integration: Business LLMs are designed to integrate out of the box with your organization’s data sources (e.g., EHRs or other clinical, claims, pharmacy, or eligibility databases). Using all of this health sector data, it can provide you with insights and reports. LLMs allow for analyzing large amounts of business data. For example, an LLM could quickly evaluate the complex and ever-changing prior authorization criteria used by payers and insurance companies.
Business LLMs specific to certain industries: Developed to benefit a certain industry such as healthcare, industry-specific LLMs can help with tasks ranging from diagnosing an illness to processing claims or making clinical decisions.
Collaboration: These LLMs often come with features for shared teamwork spaces, document collaboration/sharing, and workflow automation to increase organizational productivity.
Knowledge management: Business LLMs can help organizations collect and share knowledge by building knowledge bases, summarizing data, and offering contextual suggestions.
Customer service and support: From commodities trading to buying concert tickets, LLMs can power conversational AI assistants and chatbots to provide customer support and answer queries.
Service guarantees: Enterprise-level LLMs include service-level agreements and dedicated customer service, making them reliable for business operations.

To summarize, the key difference between consumer and business LLM tools is this: consumer LLMs are mainly geared toward personalized convenience and personal productivity, while business LLMs are built for industry-specific use cases—with custom data integration and enterprise-grade support for operations.

Bridging the Divide

This distinction between the consumer and business LLMs/generative AI is actually an important one because it influences use and its audience. It’s important to distinguish between business and consumer LLMs for several reasons:

Purpose and goals
Data training
Regulatory landscape
Ethics and bias

Purposes and goals

Business LLMs are geared to solving a specific business problem or improving business processes. These range from automating interactions between a health plan member and a customer service employee to creating insights from data that businesses already have.

Consumer LLMs are designed for individual use and educational purposes. They offer services such as language translation and conversational question-answering. Importantly, they can be tailored to individual preferences and needs, providing personalized responses based on a user’s past interactions, stated interests, or specific requirements.

Data training

Business LLMs are trained on domain-specific data sets. In this way, we can “tune” the LLM to our business domain so that it’s not only addressing content directly but also doing so with knowledge of the business context and jargon.

Consumer LLMs are trained on big, general-purpose corpora (collections of text and code) pulled from diverse public websites. These provide generalist exposure but risk bias and lack specialized knowledge and/or domain expertise. Using the AI framework RAG, the data extends to external data sources just as with a business LLM.

Regulatory landscape

Business LLMs are regulated pursuant to industry-specific rules (for example, the financial sector) or rules regulating data (e.g., medical data protection).

Consumer LLMs are bound by consumer protection statutes and regulations regarding data privacy and ethical AI practices. For example, HIPAA does impose restrictions on how an individual or their designee uses health information consistent with the individual’s right of access.

Ethics and bias

Business LLMs: Careful stewardship and mitigation of bias is necessary to avoid discriminatory or otherwise unfair treatment of potential customers, employees, etc.

Consumer LLMs: Bias in consumer LLMs can lead to harmful misinformation, offensive content, or the perpetuation of social inequalities. It’s essential that development of these technologies is responsible and that unintended bias is continuously addressed.

In conclusion, while big language models are used by both businesses and consumers, their different, albeit potentially linked, purposes and associated needs for input, control (including security), and ethical considerations should all compel us to think about their development and use differently, according to those different purposes and backgrounds.

Summary

LLMs can open up a world of potential that was once restricted to the domain of science fiction. By delving into the potential of these advanced language models, this chapter explored a range of futuristic promises and applications (two of which—the Medical Swiss Army Knife for consumers and the Medical Sherpa for clinicians—are powered by an LLM). Language models (LLMs)—machines that can read, write, and manipulate human language with remarkable fluency and flexibility—ushered in this new era. LLMs are still evolving rapidly, and their capabilities continue to improve. LLMs promise to transform patient care, research, and medical knowledge across a wide range of health sectors.

But the biggest difference may be what kind of users and use cases are anticipated for an LLM-powered app. Consumer LLM apps (like the Medical Swiss Army Knife) focus on end-user convenience in making educated medical decisions—from small-scale, self-event management and self-diagnosis to broad-based health-promotion, self-care, and family healthcare apps. Business LLM apps (like the Medical Sherpa) will cater to healthcare professionals and organizations populating or searching the ever-growing medical literature, clinicians making diagnosis decisions, and pharmaceutical researchers developing drugs. For consumer LLM apps, convenience and ease of use are key to appeal. For business LLM apps, issues such as data privacy, HIPAA and regulatory compliance, and industry-specific features are the elephants in the room.

But as society moves deeper into LLMs, their solutions and promises will shape a world that will be filled with new tools for health consumers and medical professionals alike, and create a near future diversified by greater access to healthcare and medical knowledge.

¹ C. M. Kornbluth, “The Little Black Bag,” in The Best of C. M. Kornbluth, ed. Frederik Pohl (Garden City, NY: Nelson Doubleday, 1976), 42–69.

² Matt Marshall, “NY Hospital Exec: Multimodal LLM Assistants Will Create a ‘Paradigm Shift’ in Patient Care,” VentureBeat, March 6, 2024, https://venturebeat.com/ai/ny-hospital-exec-multimodal-llm-assistants-will-create-a-paradigm-shift-in-patient-care.

³ LLMs may seem to understand human language, but they are sophisticated statistical models. These models recognize patterns, translate between languages, predict likely words, and generate coherent text. However, they don’t truly comprehend meaning in the way humans do. See “Risks of Large Language Models (LLM),” IBM Technology, April 14, 2023, YouTube video, 8:25, https://www.youtube.com/watch?v=r4kButlDLUc&t=278s.

⁴ “ChatGPT Experiments: Autoregressive Large Language Models (AR-LLMs) and the Limits of Reasoning as Structured Summarization,” The GDELT Project, February 14, 2023, https://blog.gdeltproject.org/chatgpt-experiments-autoregressive-large-language-models-ar-llms-and-the-limits-of-reasoning-as-structured-summarization.

⁵ Hoifung Poon and Jianfeng Gao, “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” Microsoft Research Blog, August 31, 2020, https://www.microsoft.com/en-us/research/blog/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing.

⁶ PubMed, accessed June 20, 2024, https://pubmed.ncbi.nlm.nih.gov.

⁷ Jinhyuk Lee, et al., “BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining,” Bioinformatics 36, no. 4 (February 2020): 1234–1240, https://academic.oup.com/bioinformatics/article/36/4/1234/5566506.

⁸ Iz Beltagy, Kyle Lo, and Arman Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” arXiv, September 10, 2019, https://arxiv.org/abs/1903.10676.

⁹ Kexin Huang, Jaan Altosaar, and Rajesh Ranganath, “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission,” CHIL ’20 Workshop, April 2–4, 2020, Toronto, ON, https://arxiv.org/pdf/1904.05342#:~:text=ClinicalBERT%20is%20an%20application%20of,task%20of%20hospital%20readmission%20prediction.

¹⁰ Alistair E. W. Johnson, et al., “MIMIC-III, a Freely Accessible Critical Care Database,” Scientific Data 3, no. 160035 (2016), https://www.nature.com/articles/sdata201635.

¹¹ “Med-PaLM,” Google Research, accessed June 20, 2024, https://sites.research.google/med-palm.

¹² Karan Singhal, et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” arXiv, May 16, 2023, https://arxiv.org/pdf/2305.09617.

¹³ Amanda Jane Modaragamage, “Top Wearable Medical Devices Used in Healthcare,” Healthnews, January 16, 2024, https://healthnews.com/family-health/healthy-living/wearable-medical-devices-used-in-healthcare.

¹⁴ David M. Levine and Ateev Mehrotra, “Assessment of Diagnosis and Triage in Validated Case Vignettes Among Nonphysicians Before and After Internet Search,” JAMA Network Open 4, no. 3 (2021): e213287, https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2777835.

¹⁵ “What’s Next in Store for Conversational AI in Enterprise Apps?” Koru, June 11, 2024, https://www.koruux.com/blog/conversational-ai-in-enterprise-apps.

¹⁶ Margaret E, Kruk, et al., “Mortality Due to Low-Quality Health Systems in the Universal Health Coverage Era: A Systematic Analysis of Amenable Deaths in 137 Countries,” Lancet 392, no. 10160 (November 17, 2018): 2203–2212, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6238021.

¹⁷ “UpToDate: Trusted, Evidence-Based Solutions for Modern Healthcare,” Wolters Kluwer, accessed June 20, 2024, https://www.wolterskluwer.com/en/solutions/uptodate.

¹⁸ Michael Morrison, “Do Hospitals That Conduct Research Provide Better Care for Patients?” Massachusetts General Hospital, press release, February 28, 2022, https://www.massgeneral.org/news/press-release/do-research-hospitals-provide-better-care-for-patients.

¹⁹ Kerrie Holley and Siupo Becker, AI-First Healthcare: AI Applications in the Business and Clinical Management of Health (O’Reilly Media, 2021), https://learning.oreilly.com/library/view/ai-first-healthcare/9781492063148.

²⁰ “The Future of AI, According to Former Google CEO Eric Schmidt,” Noema Magazine, May 21, 2024, YouTube video, 20:06, https://www.youtube.com/watch?v=DgpYiysQjeI.

²¹ “What’s Next for AI Agentic Workflows ft. Andrew Ng of AI Fund,” Sequoia Capital, March 26, 2024, YouTube video, 13:39, https://www.youtube.com/watch?v=sal78ACtGTc&t=524s.

²² Margaret Mitchell, “Ethical AI Isn’t to Blame for Google’s Gemini Debacle,” Time, February 29, 2024, https://time.com/6836153/ethical-ai-google-gemini-debacle.

Get LLMs and Generative AI for Healthcare now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

LLMs and Generative AI for Healthcare by Kerrie Holley, Manish Mathur