Skip to main content

Structured Extraction from Patient Intake Form with LLM

ยท 6 min read

Patient Intake Form Extraction with LLM

In this blog, we will show you how to use OpenAI API to extract structured data from patient intake forms with different formats, like PDF, Docx, etc. from Google Drive.

You can find the full code here ๐Ÿค—!

It would mean a lot to us if you could support Cocoindex on Github with a star if you like our work. Thank you so much with a warm coconut hug ๐Ÿฅฅ๐Ÿค—. GitHub

Video Tutorialโ€‹

Prerequisitesโ€‹

Install Postgresโ€‹

If you don't have Postgres installed, please refer to the installation guide.

Enable Google Drive access by service accountโ€‹

In the tutorial, we are going to use Google Drive as a source to load the patient intake forms. Please refer to the Google Drive guide to enable Google Drive access by service account.

You could also refer to this blog for more details with step-by-step instructions and screenshots.

Prepare testing files in Google Driveโ€‹

In the tutorial, we've added a few artificial patient intake forms in our Google Drive. They are also available in the GitHub repo. You can upload them to your own Google Drive and use them for testing. The PDF form templates credits go to getfreed.ai.

Google Drive

Extract Structured Data from Google Driveโ€‹

1. Define output schemaโ€‹

We are going to define the patient info schema for structured extraction. One of the best examples to define a patient info schema is probably following the FHIR standard - Patient Resource.

FHIR Patient Resource

In this tutorial, we'll define a simplified schema for patient information extraction:

@dataclasses.dataclass
class Contact:
name: str
phone: str
relationship: str

@dataclasses.dataclass
class Address:
street: str
city: str
state: str
zip_code: str

@dataclasses.dataclass
class Pharmacy:
name: str
phone: str
address: Address

@dataclasses.dataclass
class Insurance:
provider: str
policy_number: str
group_number: str | None
policyholder_name: str
relationship_to_patient: str

@dataclasses.dataclass
class Condition:
name: str
diagnosed: bool

@dataclasses.dataclass
class Medication:
name: str
dosage: str

@dataclasses.dataclass
class Allergy:
name: str

@dataclasses.dataclass
class Surgery:
name: str
date: str

@dataclasses.dataclass
class Patient:
name: str
dob: datetime.date
gender: str
address: Address
phone: str
email: str
preferred_contact_method: str
emergency_contact: Contact
insurance: Insurance | None
reason_for_visit: str
symptoms_duration: str
past_conditions: list[Condition]
current_medications: list[Medication]
allergies: list[Allergy]
surgeries: list[Surgery]
occupation: str | None
pharmacy: Pharmacy | None
consent_given: bool
consent_date: datetime.date | None

2. Define CocoIndex Flowโ€‹

Let's define the CocoIndex flow to extract the structured data from patient intake forms.

  1. Add Google Drive as a source

    @cocoindex.flow_def(name="PatientIntakeExtraction")
    def patient_intake_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    """
    Define a flow that extracts patient information from intake forms.
    """
    credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
    root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")

    data_scope["documents"] = flow_builder.add_source(
    cocoindex.sources.GoogleDrive(
    service_account_credential_path=credential_path,
    root_folder_ids=root_folder_ids,
    binary=True))

    patients_index = data_scope.add_collector()

    flow_builder.add_source will create a table with a few sub fields. See documentation here.

  2. Parse documents with different formats to Markdown

    Define a custom function to parse documents in any format to Markdown. Here we use MarkItDown to convert the file to Markdown. It also provides options to parse by LLM, like gpt-4o. At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata and OCR), etc. You could find its documentation here.

    class ToMarkdown(cocoindex.op.FunctionSpec):
    """Convert a document to markdown."""

    @cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
    class ToMarkdownExecutor:
    """Executor for ToMarkdown."""

    spec: ToMarkdown
    _converter: MarkItDown

    def prepare(self):
    client = OpenAI()
    self._converter = MarkItDown(llm_client=client, llm_model="gpt-4o")

    def __call__(self, content: bytes, filename: str) -> str:
    suffix = os.path.splitext(filename)[1]
    with tempfile.NamedTemporaryFile(delete=True, suffix=suffix) as temp_file:
    temp_file.write(content)
    temp_file.flush()
    text = self._converter.convert(temp_file.name).text_content
    return text

    Next we plug it into the data flow.

        with data_scope["documents"].row() as doc:
    doc["markdown"] = doc["content"].transform(ToMarkdown(), filename=doc["filename"])
  3. Extract structured data from Markdown CocoIndex provides built-in functions (e.g. ExtractByLlm) that process data using LLMs. In this example, we use gpt-4o from OpenAI to extract structured data from the Markdown. We also provide built-in support for Ollama, which allows you to run LLM models on your local machine easily.

        with data_scope["documents"].row() as doc:
    doc["patient_info"] = doc["markdown"].transform(
    cocoindex.functions.ExtractByLlm(
    llm_spec=cocoindex.LlmSpec(
    api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
    output_type=Patient,
    instruction="Please extract patient information from the intake form."))
    patients_index.collect(
    filename=doc["filename"],
    patient_info=doc["patient_info"],
    )

    After the extraction, we just need to cherrypick anything we like from the output by calling the collect method on the collector defined above.

  4. Export the extracted data to a table.

        patients_index.export(
    "patients",
    cocoindex.storages.Postgres(table_name="patients_info"),
    primary_key_fields=["filename"],
    )

Evaluateโ€‹

๐ŸŽ‰ Now you are all set with the extraction! For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. There may be some fancier ways to evaluate the extraction, but for now, we'll use a simple approach.

  1. Dump the extracted data to YAML files.

    python3 main.py cocoindex evaluate

    It dumps what should be indexed to files under a directory. Using my example data sources, it looks like the golden files with a timestamp on the directory name.

  2. Compare the extracted data with golden files. We created a directory with golden files for each patient intake form. You can find them here.

    You can run the following command to see the diff:

    diff -r data/eval_PatientIntakeExtraction_golden data/eval_PatientIntakeExtraction_output

    I used a tool called DirEqual for mac. We also recommend Meld for Linux and Windows.

    A diff from DirEqual looks like this:

    DirEqual for folder.

    And double click on any row to see file level diff. In my case, there's missing condition for Patient_Intake_Form_Joe.pdf file.

    DirEqual record diff.

Troubleshootingโ€‹

My original golden file for this record is this one.

condition_screenshot

We will troubleshoot in two steps:

  1. Convert to Markdown
  2. Extract structured data from Markdown

In this tutorial, we'll show how to use CocoInsight to troubleshoot this issue.

python3 main.py cocoindex server -c https://cocoindex.io

Go to https://cocoindex.io/cocoinsight. You could see an interactive UI to explore the data.

cocoinsight

Click on the markdown column for Patient_Intake_Form_Joe.pdf, you could see the Markdown content.

markdown

It is not well understood by LLM extraction. So here we could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or needs manual correction.

Query the extracted dataโ€‹

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

You'll see the index updates state in the terminal.

After the index is built, you have a table with the name patients_info. You can query it at any time, e.g., start a Postgres shell:

psql postgres://cocoindex:cocoindex@localhost/cocoindex

The run:

select * from patients_info;

You could see the patients_info table.

sql

You could also use CocoInsight mentioned above as debug tool to explore the data.

cocoinsight_data

๐ŸŽ‰ Now you are ready to build your pipeline that extracts patient intake forms from any format from Google Drive! ๐Ÿš€๐Ÿฅฅ If you like this post and our work, please support CocoIndex on Github with a star โญ. Thank you with a warm coconut hug ๐Ÿฅฅ๐Ÿค—.

Communityโ€‹

We love to hear from the community! You can find us on Github and Discord.