---
title: "Adaptive Batching - 5x throughput on your data pipelines"
description: "CocoIndex now batches GPU and ML workloads automatically: 5x throughput on text embeddings and AI ops, with zero configuration required."
last_updated: 2025-11-10
doc_version: "2025-11-10"
canonical: https://cocoindex.io/blogs/batching/
---
# Adaptive Batching - 5x throughput on your data pipelines

> CocoIndex now batches GPU and ML workloads automatically: 5x throughput on text embeddings and AI ops, with zero configuration required.

Published: 2025-11-10 · Canonical: https://cocoindex.io/blogs/batching/

[CocoIndex](https://github.com/cocoindex-io/cocoindex) just launched batching support for CocoIndex functions.
With batching, throughput increased to ~5X the non-batched baseline (≈80% lower runtime) when embedding the CocoIndex codebase using the *sentence-transformers/all-MiniLM-L6-v2* model.

## Why batching makes processing fast

When we call a function or remote API, every call has:

- **Fixed overhead** you pay once per call: GPU kernel launch setup, Python/C transition, scheduling, memory allocator work, framework bookkeeping, etc.

- **Data-size-dependent work** that scales with input: floating-point operations (FLOPs) for the model, bytes copied, tokens processed.

Doing 1,000 items one by one pays the fixed overhead 1,000 times.
Doing them in batches pays that overhead once per batch, and the hardware can run the size-dependent work more efficiently (fewer launches, better cache use, fuller pipelines).

**Batching helps because it:**

- **Spreads one-time overhead across many items.** You do fewer GPU kernel launches and fewer Python-to-C boundary crossings, etc.

- **Lets the GPU run bigger, more efficient matrix math.** Larger batches map to dense matrix multiplies (*General Matrix–Matrix Multiplication*, a.k.a. GEMM), which use the hardware more effectively (higher utilization).

- **Cuts down on data copies.** You reduce transfers between CPU memory and GPU memory, H2D (*Host-to-Device*) and D2H (*Device-to-Host*), so more time is spent computing, not moving bytes.

## What batching looks like for normal Python code

### Non-batching code – simple but less efficient

The most natural way to organize a pipeline is to process data piece-by-piece.
For example, a two-layer loop like this:

```python
for file in os.listdir(directory):
    content = file.read()
    chunks = split_into_chunks(content)
    for chunk in chunks:
        vector = model.encode([chunk.text])         # one item at a time
        index.upsert(file_id=file.name, chunk_offset=chunk.offset, vector=vector)
```

This is easy to read and reason about: each chunk flows straight through multiple steps.

### Batching manually – more efficient but complicated

You can speed it up by batching, but even the simplest “just batch everything once” version makes the code significantly more complicated:

```python
# 1) Collect payloads and remember where each came from
batch_texts = []
metadata = []  # (file_id, chunk_id)
for file in os.listdir(directory):
    content = file.read()
    chunks = split_into_chunks(content)
    for chunk in chunks:
        batch_texts.append(chunk.text)
        metadata.append((file.name, chunk.offset))

# 2) One batched call (library will still mini-batch internally)
vectors = model.encode(batch_texts)

# 3) Zip results back to their sources
for (file_name, chunk_offset), vector in zip(metadata, vectors):
    index.upsert(file_id=file.name, chunk_offset=chunk.offset, vector=vector)
```

Moreover, batching everything at once is usually not ideal, e.g., the next steps can only start after this step is done for all data.

## CocoIndex's batching support

CocoIndex bridges the gap and allows you to get the best out of the two – keep the simplicity of your code by following the natural flow, while getting the efficiency from batching provided by CocoIndex runtime.

We already enabled batching support for the following [builtin functions](https://cocoindex.io/docs/programming_guide/core_concepts/):

- *EmbedText*
- *SentenceTransformerEmbed*
- *ColPaliEmbedImage*
- *ColPaliEmbedQuery*

It doesn’t change the API. **Your existing code will just work without any change – still following the natural flow, while enjoying the efficiency of batching.**

For [custom functions](https://cocoindex.io/docs/programming_guide/function/), enabling batching is as simple as:

- Set `batching=True` in the custom function decorator.
- Change the arguments and return type to `list`.

For example, suppose you want to create a custom function that calls an API to build thumbnails for images.

```python
@cocoindex.op.function(batching=True)
def make_image_thumbnail(self, args: list[bytes]) -> list[bytes]:
  ...
```

See the [batching documentation](https://cocoindex.io/docs/advanced_topics/concurrency_control/) for more details.

## How CocoIndex batches

### Common approaches

To batch requests, you first accumulate them in a queue, then decide when to flush that queue as a batch. Two widely used policies are:

- **Time-based (W ms)**: Flush whatever arrived in the last *W* milliseconds.
  - *Pros:* predictable wait bound; simple.
  - *Cons:* adds idle latency when traffic is sparse; needs tuning per workload.

- **Size-based (K items)**: Flush when at least *K* items are waiting.
  - *Pros:* predictable batch size; easy to reason about memory.
  - *Cons:* under sparse traffic, the head request can wait too long; still needs tuning.

Many systems combine them (“flush when **W** or **K** triggers first”), which helps, but you’re still tuning knobs and making trade-offs that shift with traffic.

### CocoIndex’s approach

#### Framework level: adaptive, knob-free

CocoIndex keeps batching simple:

- While a batch is running on the device, new requests keep queuing.

- When that batch finishes, CocoIndex ships all currently queued requests as the next batch window and immediately starts processing again.

- No timers. No target batch size. The batch naturally reflects whatever arrived during the previous service time.

Why is this good?

- **Low latency when sparse**: With few requests, batches are tiny (often size 1), so you’re effectively running at near single-call latency.

- **High throughput when busy**: When traffic spikes, more requests accumulate during the in-flight batch, so the next batch is larger. Utilization rises automatically.

- **No tuning**: You don’t need to tune *W* or *K*. The system adapts to your traffic pattern by design.

#### Function level: pack the batch intelligently

Each function receives the batch window (all queued requests at that moment) and decides how to process it efficiently and safely for its model/library.

We use the *SentenceTransformerEmbed* function as an example.
The underlying *sentence-transformer* library accepts batches with arbitrary length, but splits them into micro-batches (default size: 32) so that each can fit into the device memory and keep kernels in the model's sweet spot.
We use the default micro-batch size.

Besides, transformer runtimes pad every sequence in a batch to the length of the longest one so the GPU can run uniform, fast kernels. Short texts pay the cost of the longest text in the batch (e.g., mixing 64-token with 256-token items makes the 64-token ones ~4X more expensive).
We sort by token count and form micro-batches of similar lengths to keep padding minimal and throughput high.

Other functions can simply ship the entire batch to the backend, or apply their own packing (e.g., SIMD tiles, merge-writes).
The framework is agnostic.
It just delivers the batch window promptly.

## Performance evaluation

We ran all benchmarks on a MacBook Pro (Apple M1 Pro, 16 GB unified memory, 2022). For each configuration:

- We run on two different versions, batching **on** (cocoindex v0.3.1) and batching **off** (cocoindex v0.2.23).

- We report two timings: (1) end-to-end benchmark wall-clock time, and (2) the cumulative time inside the embedding function (where batching happens).

- We executed each test 5 times, discarded the fastest and slowest run to dampen outliers, and reported the mean of the remaining three runs (a 20% trimmed mean).

### *text_embedding* and *code_embedding*

We run on the 2 basic examples from cocoindex:

- [*text_embedding*](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding): with 3 input files, 106 chunks in total.

- [*code_embedding*](https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding): with 273 input files, 3383 chunks in total.

Both are using the *SentenceTransformerEmbed* function, using the *all-MiniLM-L6-v2* embedding model (with 22.7M parameters).

This is the evaluation outcome:

<small>
  <table>
    <thead>
      <tr>
        <th rowspan="2">example name</th>
        <th colspan="3">end-to-end execution time</th>
        <th colspan="3">runtime in function to batch</th>
      </tr>
      <tr>
        <th>off (s)</th> <th>on (s)</th> <th>saving</th> <th>off (s)</th> <th>on (s)</th> <th>saving</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><em>text_embedding</em></td> <td>1.959</td> <td>0.630</td> <td>67.84%</td> <td>1.853</td> <td>0.567</td> <td>69.42%</td>
      </tr>
      <tr>
        <td><em>code_embedding</em></td> <td>58.931</td> <td>12.516</td> <td>78.76%</td> <td>58.343</td> <td>12.117</td> <td>79.23%</td>
      </tr>
    </tbody>
  </table>
</small>

*code_embedding* has significantly more chunks, so there are more opportunities for batching, thus the runtime saving is more significant.

### *code_embedding* on different micro-batch sizes

We experimented with different micro batch sizes on sentence-transformer:

<small>
  <table>
    <thead>
      <tr>
        <th rowspan="2">micro batch size</th>
        <th colspan="2">end-to-end execution time</th>
        <th colspan="2">runtime in function to batch</th>
      </tr>
      <tr>
        <th>runtime (s)</th> <th>saving</th> <th>runtime (s)</th> <th>saving</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>batch off</td> <td>58.931</td> <td>N/A</td> <td>58.343</td> <td>N/A</td>
      </tr>
      <tr>
        <td>4</td> <td>23.254</td> <td>60.54%</td> <td>23.083</td> <td>60.44%</td>
      </tr>
      <tr>
        <td>8</td> <td>16.522</td> <td>71.96%</td> <td>16.210</td> <td>72.22%</td>
      </tr>
      <tr>
        <td>16</td> <td>12.812</td> <td>78.26%</td> <td>12.640</td> <td>78.34%</td>
      </tr>
      <tr>
        <td>32 (default)</td> <td>12.516</td> <td>78.76%</td> <td>12.117</td> <td>79.23%</td>
      </tr>
      <tr>
        <td>64</td> <td>11.925</td> <td>79.76%</td> <td>11.577</td> <td>80.16%</td>
      </tr>
      <tr>
        <td>128</td> <td>11.939</td> <td>79.74%</td> <td>11.597</td> <td>80.12%</td>
      </tr>
    </tbody>
  </table>
</small>

There are significant improvements when we increase the batch size to 4, 8, and 16.
As the batch size increases, the improvements become smaller and smaller, since the amortized fixed overhead is already low.

Although we're able to achieve slightly better performance with 128, we keep using the default micro batch size offered by the sentence-transformer library (currently 32), as we trust it will provide and maintain a reasonable default.

### *code_embedding* on *nomic-embed-text-v1.5* model

In this experiment, we switch *all-MiniLM-L6-v2* to *nomic-embed-text-v1.5*. *nomic-embed-text-v1.5* is a larger model, with 0.1B parameters. We tested it:

<small>
  <table>
    <thead>
      <tr>
        <th colspan="3">end-to-end execution time</th>
        <th colspan="3">runtime in function to batch</th>
      </tr>
      <tr>
        <th>off (s)</th> <th>on (s)</th> <th>saving</th> <th>off (s)</th> <th>on (s)</th> <th>saving</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>137.170</td> <td>131.442</td> <td>4.18%</td> <td>136.607</td> <td>130.377</td> <td>4.56%</td>
      </tr>
    </tbody>
  </table>
</small>

The improvement is much smaller compared to the *all-MiniLM-L6-v2* model.
This is because once the model is large, the fixed overhead is usually much less than the data-size-dependent work.

### *code_embedding* on Ollama with *all-minilm* model

In this experiment, we switched from *SentenceTransformerEmbed* to *EmbedText* function, which offers the functionality to use a remote API to embed text.
We picked the *all-minilm* model, which is the same as the default model we used with *SentenceTransformerEmbed* above.

For ease of comparison, we copy the data from running *SentenceTransformerEmbed* on the same model from above:

<small>
  <table>
    <thead>
      <tr>
        <th rowspan="2">function for embedding</th>
        <th colspan="3">end-to-end execution time</th>
      </tr>
      <tr>
        <th>off (s)</th> <th>on (s)</th> <th>saving</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><em>SentenceTransformerEmbed</em></td> <td>58.931</td> <td>12.516</td> <td>78.76%</td>
      </tr>
      <tr>
        <td><em>EmbedText</em> with Ollama</td> <td>44.248</td> <td>38.343</td> <td>13.35%</td>
      </tr>
    </tbody>
  </table>
</small>

Here we can see that with no batching when calling [Ollama](https://cocoindex.io/blogs/cocoindex-ollama-structured-extraction-from-pdf), the execution time is better than using *SentenceTransformerEmbed* without batching.
But when there's batching, the savings are quite small, so it's much slower than using *SentenceTransformerEmbed*.

After digging deeper, we noticed that Ollama computes embeddings for different inputs separately ([Ollama code](https://github.com/ollama/ollama/blob/392a270261dfb1d1cee1de3713836b503a7526ce/server/routes.go#L730-L744)), even if the inputs come from the same request.
So batching doesn't save much for Ollama – it still saves a little bit, likely because of the reduced HTTP API calls.

## Conclusion

In conclusion, batching significantly enhances processing speed by amortizing fixed overhead across multiple items, enabling more efficient GPU operations, and reducing data transfer.
CocoIndex simplifies this by offering automatic batching for several built-in functions and an easy `batching=True` decorator for custom functions.

The greatest impact of batching is seen when fixed overhead constitutes a larger portion of the total work, such as with smaller models.
It's also most effective when the underlying API or library fully supports batched operations, as demonstrated by the limited gains observed with Ollama.

## Support us

⭐ Star [CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex) and share with your community if you find it useful!

## Sitemap

- [Blog index](https://cocoindex.io/blogs/)
- [Site index (llms.txt)](https://cocoindex.io/llms.txt)
- [Full blog corpus](https://cocoindex.io/llms-full.txt)
- [Markdown sitemap](https://cocoindex.io/sitemap.md)
- [XML sitemap](https://cocoindex.io/sitemap.xml)
- [RSS feed](https://cocoindex.io/blogs/rss.xml)