# Structured Extraction from Patient Intake Form with LLM

> Extract structured data from patient intake forms in PDF and Word documents using an LLM and CocoIndex: a practical healthcare document extraction example.

Published: 2025-03-26 · Canonical: https://cocoindex.io/blogs/patient-intake-form-extraction-with-llm/

In this blog, we will show you how to use the OpenAI API to extract structured data from patient intake forms with different formats, like PDF, DOCX, etc.

You can find the full code [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction) 🤗.

If this tutorial is helpful, please give [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star ⭐. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)

## Video tutorial

## Prerequisites
### Install Postgres
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).

### Google Drive as alternative source (optional)
If you plan to load patient intake forms from Google Drive, you can refer to this [example](https://cocoindex.io/blogs/text-embedding-from-google-drive#enable-google-drive-access-by-service-account) for more details.

## Extract structured data from Google Drive
### 1. Define output schema

We are going to define the patient info schema for structured extraction. One of the best examples to define a patient info schema is probably following the [FHIR standard - Patient Resource](https://build.fhir.org/patient.html#resource).

In this tutorial, we'll define a simplified schema for patient information extraction:

```python
@dataclasses.dataclass
class Contact:
    name: str
    phone: str
    relationship: str

@dataclasses.dataclass
class Address:
    street: str
    city: str
    state: str
    zip_code: str

@dataclasses.dataclass
class Pharmacy:
    name: str
    phone: str
    address: Address

@dataclasses.dataclass
class Insurance:
    provider: str
    policy_number: str
    group_number: str | None
    policyholder_name: str
    relationship_to_patient: str

@dataclasses.dataclass
class Condition:
    name: str
    diagnosed: bool

@dataclasses.dataclass
class Medication:
    name: str
    dosage: str

@dataclasses.dataclass
class Allergy:
    name: str

@dataclasses.dataclass
class Surgery:
    name: str
    date: str

@dataclasses.dataclass
class Patient:
    name: str
    dob: datetime.date
    gender: str
    address: Address
    phone: str
    email: str
    preferred_contact_method: str
    emergency_contact: Contact
    insurance: Insurance | None
    reason_for_visit: str
    symptoms_duration: str
    past_conditions: list[Condition]
    current_medications: list[Medication]
    allergies: list[Allergy]
    surgeries: list[Surgery]
    occupation: str | None
    pharmacy: Pharmacy | None
    consent_given: bool
    consent_date: datetime.date | None
```

### 2. Define CocoIndex flow
Let's define the CocoIndex flow to extract the structured data from patient intake forms.

1.  Add Google Drive as a source
    ```python
    @cocoindex.flow_def(name="PatientIntakeExtraction")
    def patient_intake_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
        """
        Define a flow that extracts patient information from intake forms.
        """
        credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
        root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
        
        data_scope["documents"] = flow_builder.add_source(
            cocoindex.sources.GoogleDrive(
                service_account_credential_path=credential_path,
                root_folder_ids=root_folder_ids,
                binary=True))

        patients_index = data_scope.add_collector()
    ```

    `flow_builder.add_source` will create a table with a few sub fields. See [documentation](https://cocoindex.io/docs/sources) here.

2.  Parse documents with different formats to Markdown

    Define a custom function to parse documents in any format to Markdown. Here we use [MarkItDown](https://github.com/microsoft/markitdown) to convert the file to Markdown. It also provides options to parse by LLM, like `gpt-4o`.
    At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata and OCR), etc. You could find its documentation [here](https://github.com/microsoft/markitdown).

    ```python
    class ToMarkdown(cocoindex.op.FunctionSpec):
        """Convert a document to markdown."""

    @cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
    class ToMarkdownExecutor:
        """Executor for ToMarkdown."""

        spec: ToMarkdown
        _converter: MarkItDown

        def prepare(self):
            client = OpenAI()
            self._converter = MarkItDown(llm_client=client, llm_model="gpt-4o")

        def __call__(self, content: bytes, filename: str) -> str:
            suffix = os.path.splitext(filename)[1]
            with tempfile.NamedTemporaryFile(delete=True, suffix=suffix) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                text = self._converter.convert(temp_file.name).text_content
                return text
    ```

    Next we plug it into the data flow.

    ```python
        with data_scope["documents"].row() as doc:
            doc["markdown"] = doc["content"].transform(ToMarkdown(), filename=doc["filename"])
    ```

3.  Extract structured data from Markdown
    CocoIndex provides built-in functions (e.g. `ExtractByLlm`) that process data using LLMs. In this example, we use `gpt-4o` from OpenAI to extract structured data from the Markdown. We also provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. 

    ```python
        with data_scope["documents"].row() as doc:
            doc["patient_info"] = doc["markdown"].transform(
                cocoindex.functions.ExtractByLlm(
                    llm_spec=cocoindex.LlmSpec(
                        api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
                    output_type=Patient,
                    instruction="Please extract patient information from the intake form."))
            patients_index.collect(
                filename=doc["filename"],
                patient_info=doc["patient_info"],
            )
    ```

    After the extraction, we just need to cherry-pick anything we like from the output by calling the collect method on the collector defined above.

4.  Export the extracted data to a table.

    ```python
        patients_index.export(
            "patients",
            cocoindex.storages.Postgres(table_name="patients_info"),
            primary_key_fields=["filename"],
        )
    ```

## Evaluate
🎉 Now you are all set with the extraction! For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. There may be some fancier ways to evaluate the extraction, but for now, we'll use a simple approach.

1.  Dump the extracted data to YAML files.

    ```sh
    python3 main.py cocoindex evaluate
    ```

    It dumps what should be indexed to files under a directory. Using my example data sources, it looks like [the golden files](https://github.com/cocoindex-io/patient-intake-extraction/tree/main/data/eval_PatientIntakeExtraction_golden) with a timestamp on the directory name.

2.  Compare the extracted data with golden files.
    We created a directory with golden files for each patient intake form. You can find them [here](https://github.com/cocoindex-io/patient-intake-extraction/tree/main/data/eval_PatientIntakeExtraction_golden).

    You can run the following command to see the diff:
    ```sh
    diff -r data/eval_PatientIntakeExtraction_golden data/eval_PatientIntakeExtraction_output
    ```

    I used a tool called [DirEqual](https://apps.apple.com/us/app/direqual/id1435575700) for Mac. We also recommend [Meld](https://meldmerge.org/) for Linux and Windows.

    A diff from DirEqual looks like this:

    .

    And double-click on any row to see the file-level diff. In my case, there's a missing `condition` for the `Patient_Intake_Form_Joe.pdf` file.

    .

### Troubleshooting

My original golden file for this record is [this one](https://github.com/cocoindex-io/patient-intake-extraction/blob/main/data/example_forms/Patient_Intake_Form_Joe_Artificial.pdf). 

We will troubleshoot in two steps:
1. Convert to Markdown
2. Extract structured data from Markdown

In this tutorial, we'll show how to use CocoInsight to troubleshoot this issue. 

```sh
cocoindex server -ci main
```

Go to https://cocoindex.io/cocoinsight. You could see an interactive UI to explore the data.

Click on the `markdown` column for `Patient_Intake_Form_Joe.pdf`, and you could see the Markdown content.

It is not well understood by LLM extraction. So here we could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or need manual correction.

## Query the extracted data

Run the following commands to set up and update the index.
```sh
cocoindex setup main
cocoindex update main
```
You'll see the index update status in the terminal.

After the index is built, you have a table with the name `patients_info`. You can query it at any time, e.g., start a Postgres shell:

```sh
psql postgres://cocoindex:cocoindex@localhost/cocoindex
```

Then run:

```sql
select * from patients_info;
```

You could see the patients_info table.

You could also use CocoInsight mentioned above as a debug tool to explore the data.

## Support us
We are constantly improving, and more features and examples are coming soon. If this tutorial is helpful, please give us a star ⭐ at [GitHub](https://github.com/cocoindex-io/cocoindex) to help us grow.
