EP Context Design

OnnxRuntime EP Context Cache機能設計

背景

ONNX Runtime 実行プロバイダー（EP） により、ユーザーはバックエンドSDK（例：QNN、OpenVINO、Vitis AIなど）を活用して様々なハードウェアアクセラレーター上でONNXモデルを実行できます。
実行プロバイダーはONNXモデルをバックエンドSDKが必要とするグラフ形式に変換し、ハードウェアが必要とする形式にコンパイルします。
NPUドメインでは、この変換とコンパイルプロセスは時間がかかることがあり、特にLLMモデルの場合、完了まで数十分かかることがあります。これはセッション作成時のユーザーエクスペリエンスに大きく影響します。
モデル変換とコンパイルの繰り返しオーバーヘッドを排除するため、ほとんどのバックエンドSDKは事前コンパイルされたモデルをバイナリファイルにダンプする機能を提供しています。

事前コンパイルされたモデルは、バックエンドSDKによって直接読み込まれ、ターゲットデバイス上で実行できます。
これによりセッション作成時間が大幅に短縮され、全体的な効率が向上します。この最適化をサポートするため、ONNX RuntimeはMSドメインにEPContextと呼ばれるコントリビューター演算子を導入しました。

EPContext演算子スキーマ

演算子ドメイン: com.microsoft
ノード入力・出力: 可変長

以下の属性表：

属性	データ型	説明
main_context	int64	1（デフォルト）: このノードはこのノードに関連付けられたグラフを含むEPコンテキストコンテンツを参照します。 0: ノードはEPコンテキストコンテンツを参照しません。代わりに、このフィールドが1に設定された別のノードからグラフを取得することを期待します。一部のEPは複数のグラフを含む単一のコンテキストをサポートしています。 `main_context = 1`の`EPContext`ノードはプライマリコンテキストを参照します。このコンテキストは複数のグラフを含み、`main_context = 0`の他のノードから参照できます。
ep_cache_context	string	`embed_mode = 1`の場合はEPコンテキストのペイロード、`embed_mode = 0`の場合はコンテキストファイルへのパス。パスはONNXモデルファイルに対する相対パスで、ファイル名またはサブフォルダー/ファイル名のいずれかです。
embed_mode	int64	1（デフォルト）: `ep_cache_context`はコンテキストコンテンツのペイロードを含みます。 0: `ep_cache_context`はコンテキストバイナリへのファイルパスを含みます。
ep_sdk_version	string	オプション。このノードを生成するために使用されたSDKバージョン。
onnx_model_filename	string	オプション。元のOnnxモデルファイル名。
hardware_architecture	string	オプション。ハードウェアアーキテクチャ。
partition_name	string	オプション。OnnxRuntimeパーティション化グラフ名。
source	string	オプション。このノードを生成するために使用されたソース識別子。これはEPによって定義された一意のキーである必要があり、ONNX Runtimeが異なるEPで実行される複数の`EPContext`ノードをサポートできるようにします。例： QNN EPは`source = QNN`または`QnnExecutionProvider`のノードのみを受け入れます。 OpenVINO EPは`source = OpenVINOExecutionProvider`のノードのみを受け入れます。
notes	string	オプション。特定のEPが必要とする追加情報。
max_size	int64	オプション。コンテキスト内の最大サイズで、使用方法はEPに依存します。デフォルトは0です。

EP Context node example

EPコンテキストキャッシュ生成と推論に関連するOnnxRuntimeセッションオプション

セッションオプション	説明
ep.context_enable	EPコンテキストモデル生成のみで使用。 1: ONNX Runtimeがコンテキストキャッシュモデルをダンプすることを有効にします。 0（デフォルト）: コンテキストモデルダンプを無効にします。
ep.context_file_path	ダンプされたモデルのファイルパスを指定します。デフォルト: コンテキストモデル生成用の`original_file_name_ctx.onnx`。モデル推論の場合：ユーザーがメモリバッファからモデルを読み込み、EPコンテキストバイナリがONNXモデルの外部にある場合、このオプションを設定する必要があります。 ONNX Runtime EPはこのパスを使用してフォルダーの場所を決定し、`ep_cache_context`（コンテキストバイナリパスを指す）と組み合わせてコンテキストバイナリファイルへの絶対パスを構築します。
ep.context_embed_mode	コンテキストモデル生成のみで使用。 1: EPコンテキストコンテンツを直接ONNXモデルにダンプし、`ep_cache_context`ノード属性内に格納します。 0（デフォルト）: EPコンテキストコンテンツを別ファイルにダンプし、ファイル名をONNXモデルに格納します。ファイルパスは`ep_cache_context`ノード属性で追跡されます。
ep.context_node_name_prefix	コンテキストモデル生成のみで使用。 `EPContext`ノード名のプレフィックスを指定します（`partition_name`属性と内部グラフ名としても使用）。複数の`EPContext`ノードが単一モデルに結合される際のノード間の一意性を保証し、名前の競合を防ぎます。 EPは変換されたEPコンテキストバイナリ内の`ep_graph`名にもこのプレフィックスを適用できます。
session.model_external_initializers_file_folder_path	これはEPContext設計に特有ではありません。一般的に、外部データを持つモデルの場合、メモリバッファからモデルを読み込むとき、セッションはモデルの名前とパスを見失い、外部データファイルを見つけることができません。この設定を使用して外部データファイルのフォルダーパスを指定します。すべての外部データファイルは同じフォルダー内に配置する必要があります。
ep.context_model_external_initializers_file_name	コンテキストモデル生成のみで使用。この設定は、一部のノードがCPU EPでパーティション化され、それらのノードが外部初期化子を持つ場合に使用されます。EPコンテキストモデルを生成する際、新しいモデルはソースONNXモデルが使用していた古い外部データファイルに依存すべきではありません。外部初期化子ファイルでEPコンテキストモデルをダンプする際にこの設定を使用します。指定された場合、すべての初期化子は外部データファイル内に配置されます。そうでなければ、すべての初期化子は生成されたONNXファイル内に埋め込まれます。デフォルトでは、このオプションは設定されていません。つまり、すべての初期化子はONNXファイル内に含まれます。

EP Context Cache Model Generation Workflow

EP Interface `GetEpContextNodes()` for Generating the EP Context Cache Model

Generating the partitioned graph directly within the Execution Provider (EP) code is challenging, as the EP lacks a complete view of the entire partitioned graph. To address this, ONNX Runtime introduces a new Execution Provider interface: GetEpContextNodes().

virtual const InlinedVector<const Node*> GetEpContextNodes() const {
  return InlinedVector<const Node*>();
}

This API returns an array of pointers to EPContext nodes.
Execution Providers should implement this interface if they need to generate the context cache model. Otherwise, they can leave it unimplemented.
It is the EP’s responsibility to create the EPContext nodes along with their dependencies (e.g., the context binary file if embed_mode = 0).
The ONNX Runtime GraphPartitioner uses this interface to retrieve the EPContext nodes and generate the partitioned ONNX model. EP context model generation code details here

EP Context Cache Model Generation Guidelines

OnnxRuntime EPs should adhere to the following guidelines to create the EP context cache model and maintain a unified user interface:

Ownership
- The Execution Provider (EP) is responsible for creating the EPContext node along with its dependencies.
- The ONNX Runtime framework is responsible for generating the EP context ONNX model using the EPContext node list provided by the EP.
Lifetime
- The lifetime of EPContext nodes begins at least when the EP calls compile and ends when the EP is destroyed.
ep.context_enable
- ONNX Runtime creates the EP context cache model if ep.context_enable = 1.
- Otherwise, if ep.context_enable = 0 (default), ONNX Runtime follows the standard workflow without generating a cache model.
ep.context_file_path
- If ep.context_file_path is not provided, ONNX Runtime generates the output model file name by replacing .onnx in the original input model file name with _ctx.onnx.
- If ep.context_file_path is specified, ONNX Runtime uses the provided file path. The EP should also use this path to determine the folder location for dumping the compiled EP context binary file when ep.context_embed_mode = 0.
- Note: ep.context_file_path is required when loading the model from a memory buffer, as ONNX Runtime cannot retrieve the original model file path in this scenario.
ep.context_embed_mode
- 1: Embeds the EP context content directly into the ONNX model.
- 0 (default): Dumps the EP context content into a separate file (EP context binary file).
  - There should be a single EP context binary, even if multiple partitioned subgraphs exist. If the EP cannot achieve this in the short term, please note it on the EP webpage. In such cases, users will need to determine the necessary files for production deployment by iterating through all primary EPContext nodes (nodes with embed_mode=1) and extracting the file paths from the node attribute ep_cache_context.
  - The EP context binary file name should be [model_name]_[ep].bin.
  - The EP records the context binary file name in the EPContext node attribute ep_cache_context.
  - The context binary file must be located in the same directory as the dumped ONNX model file.
  - The file path recorded in the EPContext node is a relative path to the ONNX model file.
  - Note: Subfolders are allowed.
ep.context_node_name_prefix
- If the user wants to add a custom prefix to the EPContext node name (also applied to the partition_name attribute and graph name), the EP should provide this capability when generating EPContext nodes.
- This is useful when combining multiple EPContext nodes from different models into a single model, where there is a risk of node name or graph name conflicts across models.
- The EP should support multiple EP contexts within a single model, enabling users to merge and interconnect EPContext nodes generated from different models.
Source model with external data
When the source model relies on an external data file, ONNX uses a relative path to locate that file. Therefore, the external data file must reside in the same directory as the source model. However, newly generated models should not depend on any original source files. This approach is driven by several considerations:
- All newly generated files should be located in the same directory.
- There’s no guarantee that the output files will be generated in the same directory as the source files.
- The EPContext design allows a model to be partitioned by multiple EPs, each compiling its own EPContext nodes. A unified and standardized process helps avoid data duplication.
- Some EPs may need to copy weights from the source into their context binaries to satisfy specific data layout requirements.
- For subgraphs that fall back to the ONNX Runtime CPU EP, all weight data will, by default, be embedded directly into the newly generated [model_name]_ctx.onnx model. If ep.context_model_external_initializers_file_name is set, then all weight data will instead be saved to the specified external initializers file.

Usage Scenario Code Examples

Generate the EPContext model by creating session from model path:

    Ort::SessionOptions so;

    // Enable EPContext ONNX model dumping
    so.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1");

    // Add the execution provider (using QNN as an example)
    so.AppendExecutionProvider("QNN", provider_options);

    // Create the session to dump the `_ctx.onnx` model
    Ort::Session session1(env, "./model1.onnx", so);

Generate the EPContext model by creating session from model in memory buffer:
Similar to the C API CreateSessionFromArray, the example below creates an ONNX Runtime session from a model stored in a memory array, causing the session to lose track of the model’s name and path. To generate the EPContext model, you must specify the file path using: ep.context_file_path.

    // Read model file into buffer array
    std::vector<char> buffer;
    ReadFileToBuffer("./model1.onnx", buffer);

    Ort::SessionOptions so;

    // Enable EPContext ONNX model dumping
    so.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1");

    // Specify the generated EPContext model file path using option ep.context_file_path
    so.AddConfigEntry(kOrtSessionOptionEpContextFilePath, "./model_ctx.onnx");

    // Add the execution provider (using QNN as an example)
    so.AppendExecutionProvider("QNN", provider_options);


    // Create the session to dump the `_ctx.onnx` model
    Ort::Session session1(env, buffer.data(), buffer.size(), so);

Generate the EPContext model by creating session from model in memory buffer, and model has external weights:
Create the session from memory array, and the model depend on external data. The session requires session.model_external_initializers_file_folder_path to figure out the external data location, and same with previously example, ep.context_file_path to set the file path for the generated EPContext model.

    // Read model file into buffer array
    std::vector<char> buffer;
    ReadFileToBuffer("./model_folder/model1.onnx", buffer);

    Ort::SessionOptions so;

    // Enable EPContext ONNX model dumping
    so.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1");

    // Specify the generated EPContext model file path using option ep.context_file_path
    so.AddConfigEntry(kOrtSessionOptionEpContextFilePath, "./model_folder/model_ctx.onnx");

    // Specify the external data folder path using option session.model_external_initializers_file_folder_path
    so.AddConfigEntry(kOrtSessionOptionsModelExternalInitializersFileFolderPath, "./external_data_folder/");

    // Add the execution provider (using QNN as an example)
    so.AppendExecutionProvider("QNN", provider_options);


    // Create the session to dump the `_ctx.onnx` model
    Ort::Session session1(env, buffer.data(), buffer.size(), so);

Note: If there is a subgraph fallback on the CPU EP that depends on external data, the generated EPContext model should not rely on the original external data file used by the base model. By default, the EPContext model embeds all external data directly into the generated ONNX file. If you need to store weights in an external file, set ep.context_model_external_initializers_file_name. This option forces all initializers to be saved in the specified external file.

Inference Workflow for EP Context Cache Models

ONNX Runtime EPs that support loading models with EPContext nodes should follow the workflow and rules below for model inference:

Model Identification
- The EP should first determine whether the model contains EPContext nodes.
  - If no EPContext nodes are present, the EP follows its normal inference workflow.
  - If the model contains EPContext nodes:
    - The EP should inspect the source node attribute of all EPContext nodes to verify if any of them are intended for the current EP (i.e., the source attribute matches the key expected by the EP).
    - The EP should only partition the EPContext nodes where the source attribute matches the key required by the EP.
    - The EP loads the cached context from the matched EPContext nodes.
Handling External Context Binaries (embed_mode = 0) When the EPContext cache model is generated with embed_mode = 0, the context binary is stored as a separate file alongside the ONNX model in the same folder.
- ONNX Runtime retrieves the relative path of the context binary file from the ep_cache_context attribute of the EPContext node.
- For models loaded from a file path:
  - The EP should determine the folder path of the input model file and combine it with the relative path to construct the full path to the context binary file.
- For models loaded from a memory buffer:
  - Since the EP cannot derive the model’s folder path, the user must specify the session option ep.context_file_path.
  - The EP uses ep.context_file_path to determine the folder path and combines it with the relative path to construct the full path to the context binary file.
Support for Multiple Primary EPContext Nodes (main_context = 1)
- The EP should support multiple primary EPContext nodes without any limitations.
- The EP must be capable of loading all EP context binary buffers/files specified in the ep_cache_context attributes of the EPContext nodes, deserializing them, managing the ep_graphs, and selecting the appropriate one for execution.
Error Handling During EP Context Binary Loading

The EP or its backend SDK should be capable of detecting common failure scenarios (including but not limited to the following). In such cases, the EP should return a status with the INVALID_GRAPH status code:
- Detect mismatches between the driver version and the version required by the EP context binary; return an error if they are incompatible.
- Detect mismatches between the runtime SDK version and the version used to generated the EP context binary; return an error if they are incompatible.
- Return an error if loading the EP context binary fails for any reason.

EP Context nodes with different EPs

Usage Scenario Code Examples

Create inference session from pre-compiled EPContext model:
Create the session from model file path. If there is external EP context binary file, the session can figure out the binary file path from the model file path.

    Ort::SessionOptions so;

    // Add EP, take QNN for example
    so.AppendExecutionProvider("QNN", provider_options);

    // Create sessions to load from the _ctx.onnx model
    Ort::Session session1(env, "model1_ctx.onnx", so);

    session1.run(...);

Create inference session from pre-compiled EPContext model in memory buffer:
Creating a session from a memory buffer of the model causes the session to lose track of the model’s name and path. To resolve this, you must set: ep.context_file_path.

The session uses this path to identify the folder location.
With the EP context binary file name from the EPContext node, the session constructs the full path to the final EP context binary file.

    // Read model file into buffer array
    std::vector<char> buffer;
    ReadFileToBuffer("./model_folder/model_ctx.onnx", buffer);

    Ort::SessionOptions so;

    // Specify the EPContext model file path using option ep.context_file_path
    so.AddConfigEntry(kOrtSessionOptionEpContextFilePath, "./model_path/model_ctx.onnx");

    // Add EP, take QNN for example
    so.AppendExecutionProvider("QNN", provider_options);

    // Create sessions to load from the buffer
    Ort::Session session1(env, buffer.data(), buffer.size(), so);

    session1.run(...);

In ONNX, weight sharing refers to multiple ONNX models with external weights pointing to the same external weight file. These models use the same tensor names, allowing them to reference the same tensor data.

Weight sharing across Onnx models

EP weight sharing is enabled using a pre-generated EP context binary/blob. To do this, users must generate the context binary offline (Ahead Of Time).

Some EPs require specific platforms, such as Linux x86_64 and/or Windows x86_64. Please refer to the specific EP page for details.
The EP context binary contains multiple graphs that share the same tensors.

Weight sharing in EP context binary

The EP or backend SDK should be capable of converting and compiling the graph as described above.

The EP or SDK should identify identical weights from the existing EP context generated by previously compiled graphs.
When new graphs are compiled into the EP context, they should reuse existing weights if they are recognized as identical. For example, in [model_name]_[ep].bin, tensor1_1 from ep_graph1 and tensor2_1 from ep_graph2 are identical and both point to the same data offset, tensor_data1.

Weight sharing workflow

Each ONNX Runtime session is associated with an ONNX model. Models that share weights are grouped into a model group, while ONNX Runtime sessions with common properties are organized into a session group. ONNX Runtime introduces two session options: ep.share_ep_contexts and ep.stop_share_ep_contexts to facilitate session grouping.

All ONNX Runtime sessions within the session group should have ep.share_ep_contexts enabled.
The final ONNX Runtime session uses ep.stop_share_ep_contexts to indicate that it is the last session in the group. Note: A single ONNX model may contain multiple EPContext nodes, depending on the graph partitioning result. However, for simplicity, each model is shown with only one EPcontext node here.

Shared Workspace Creation:
The first session creates a shared workspace (e.g., EP Singleton) to share resources with other sessions.
EP Context Binary File Naming:
The EP context binary file name is determined by the first session and stored in the shared workspace (e.g., EP Singleton) for use across session groups.
The EP context binary file name should be [model1_name]_[ep].bin.
Graph Compilation:
All sessions in the session group compile their graphs into the shared resource.
EPContext Model Generation:
Each session in the session group creates an EPContext ONNX model. The EP generates an EPContext node that references the EP context binary file name. The ONNX Runtime framework then dumps the EPContext ONNX model.
Final EP Context Binary File Generation:
The last session (the one with ep.stop_share_ep_contexts enabled) in the session group generates the final EP context binary file using the name stored in the shared workspace.
Shared Workspace Cleanup:
The last session clears the shared workspace. An empty shared workspace indicates that the next session to run is the first session.
Number of Files Generated:
For N source models that share weights, a total of N+1 files should be generated.
The generated files are model1_ctx.onnx, ..., modeln_ctx.onnx, [model1_name]_[ep].bin.

User Code Example

    Ort::SessionOptions so;

    // Enable EPContext ONNX model dumping
    so.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1");

    // Enable EP context sharing across sessions
    so.AddConfigEntry(kOrtSessionOptionShareEpContexts, "1");

    // Add the execution provider (using QNN as an example)
    so.AppendExecutionProvider("QNN", provider_options);

    // Create the first session to dump the model1_ctx.onnx file
    Ort::Session session1(env, "model1.onnx", so);

    // Mark the last session by enabling ep.stop_share_ep_contexts
    so.AddConfigEntry(kOrtSessionOptionStopShareEpContexts, "1");

    // Create the last session to dump the model2_ctx.onnx file and generate the [model1_name]_[ep].bin
    Ort::Session session2(env, "model2.onnx", so);

OnnxRuntime provides the ep_weight_sharing_ctx_gen tool to automate the weight-sharing workflow. This tool handles the entire process. This tool is specifically designed for weight sharing scenarios, streamlining the EPContext model generation process. Example command line:

./ep_weight_sharing_ctx_gen -e qnn -i "soc_model|60 htp_graph_finalization_optimization_mode|3" ./model1.onnx,./model2.onnx

It creates two Onnx models (model1_ctx.onnx, model2_ctx.onnx) and one QNN context binary file ([model1_name]_[ep].bin).

To use the dumped EPContext models with weight sharing enabled, ONNX Runtime inference sessions must have resource sharing activated. This is done by setting the session option:

    ep.share_ep_contexts = 1

Create the first OnnxRuntime inference session
- Set session option: ep.share_ep_contexts=1.
- Load the model1_ctx.onnx model.
- The shared workspace is initially empty.
- The EP loads [model1_name]_[ep].bin and deserializes the binary to retrieve all graphs (e.g., ep_graph1, ep_graph2).
- The EPContext node in model1_ctx.onnx specifies the use of ep_graph1.
- The session uses ep_graph1 for inference.
- The remaining graphs (ep_graph2) are placed into the shared workspace for future sessions.
Create the Second ONNX Runtime Inference Session
- Set session option: ep.share_ep_contexts=1.
- Load the model2_ctx.onnx model.
- The EPContext node in model2_ctx.onnx specifies the use of ep_graph2.
- The shared workspace already contains ep_graph2.
- The EP skips loading [model1_name]_[ep].bin since the required graph is already available in the shared workspace.
- The session moves ep_graph2 from the shared workspace to the current session, making it no longer accessible from the shared workspace.
Session Cleanup Best Practices
- To avoid issues during concurrent execution, it is recommended to destroy the sessions in reverse order (i.e., destroy the second session before the first session).
- This ensures proper resource management and prevents potential conflicts with shared resources.

User Code Example

    Ort::SessionOptions so;
    // enable ep.share_ep_contexts
    so.AddConfigEntry(kOrtSessionOptionShareEpContexts, "1");

    // Add EP, take QNN for example
    so.AppendExecutionProvider("QNN", provider_options);

    // Create sessions to load from the _ctx.onnx models with resource sharing enabled
    Ort::Session session1(env, "model1_ctx.onnx", so);
    Ort::Session session2(env, "model2_ctx.onnx", so);

    session1.run(...);
    session2.run(...);

EP Context Design

OnnxRuntime EP Context Cache機能設計

目次

背景

EPContext演算子スキーマ

EPコンテキストキャッシュ生成と推論に関連するOnnxRuntimeセッションオプション

EP Context Cache Model Generation Workflow

EP Interface GetEpContextNodes() for Generating the EP Context Cache Model

EP Context Cache Model Generation Guidelines

Usage Scenario Code Examples

Inference Workflow for EP Context Cache Models

Usage Scenario Code Examples

EPContext with Weight Sharing

Weight Sharing in Onnx Domain

Weight Sharing in EP Domain with EPContext

EPContext Model Generation with Weight Sharing Workflow

Implementation Guidelines for EPContext Model Generation with Weight Sharing

User Code Example

General Tool for EPContext Model Generation with Weight Sharing

Inference Sessions from EPContext Models with Weight Sharing

Implementation Guidelines for Inferencing from EPContext Models with Weight Sharing

User Code Example

EP Interface `GetEpContextNodes()` for Generating the EP Context Cache Model