Model Configuration | PetroAI Documentation

# Model Configuration

The model_configs block defines a flexible setup for training one or more machine learning models. Each model entry is uniquely named and includes customizable filters and feature selections. Below is an example configuration:

{
  "model": {
    "model_configs": {
      "WCA": {
        "training_filter": "interval == 'WCA' & lateralLength <= 7500 & totalProppantByPerfLength >=1500 & totalFluidByPerfLength >= 1000 & totalDrainage >0 & prodOil12mo >= 10000 & prodOil12mo <= 800000",
        "evaluation_filter": "interval == 'WCA' & lateralLength <= 7500 & totalProppantByPerfLength >=1500 & totalFluidByPerfLength >= 1000 & totalDrainage >0",
        "model_features": [
          "totalDrainage",
          "lateralLength",
          "totalProppantByPerfLength",
          "totalFluidByPerfLength",
          "wellExtras_GOR_12mo",
          "attributes_Isopach",
          "attributes_Porosity"
        ]
      },
      "WCB": {
        "training_filter": "interval == 'WCB' & lateralLength <= 7500 & totalProppantByPerfLength >=1500 & totalFluidByPerfLength >= 1000 & totalDrainage >0 & prodOil12mo >= 10000 & prodOil12mo <= 800000",
        "evaluation_filter": "interval == 'WCB' & lateralLength <= 7500 & totalProppantByPerfLength >=1500 & totalFluidByPerfLength >= 1000 & totalDrainage >0",
        "model_features": [
          "totalDrainage",
          "lateralLength",
          "totalProppantByPerfLength",
          "totalFluidByPerfLength",
          "wellExtras_GOR_12mo",
          "attributes_Isopach",
          "attributes_Porosity"
        ]
      }
    }
  }
}

# Structure

Each model configuration includes the following components:

Model Name: A text string the uniquely identifies each model. Can have one or more of sections. In the example above, "WCA" and "WCB" are the model names.
training_filter: A Boolean expression that defines which wells are included during model training. This can include:
- Target intervals or zones
- Inclusion or exclusion of specific well IDs
- Minimum/maximum thresholds for attributes (e.g., completion year, lateral length, proppant/ft, 12mo oil production)
evaluation_filter: A separate but typically similar filter applied during model evaluation. Evaluations include predictions for PDP, GRID, and INV wells.
model_features: A list of input features used for training the model. These can include:
- Core PetroAI columns referenced in the Well Features Glossary (opens new window)
  - Engineered drainage features
  - Completions features
  - Parent/child/sibling well metrics
  - Spatial or orientation features (e.g., lat/lon, SHmax, orientation, depth)
- Geological or petrophysical attributes. These columns are prefixed with "attributes_".
- Extra columns (i.e. non-core to PetroAI). These columns are prefixed with "wellExtras_".

# Notes

Any column in the CORE_well_features table can be used in model features and filters, but the column names must exactly match the spelling and casing as they appear in the dataset.
Multiple models can be defined under model_configs, each with its own custom filters and feature set.
Filters use logical operators (&, |, not in, ==, >, >=,<,<=, etc.) and follow a Python-style boolean syntax.
There is no limit to the number of models or features, allowing full flexibility for comparative training, interval-specific models, scenario testing, etc

# Training Filters

# Suggested Usage

The training_filter configuration passes a list of filters to select a subset of the data for training the machine learning models. It is recommended to use these filters to remove any erroneous data from the model. Suggested training_filters would include these limits:

interval in ['A', 'B', ...] - Limit the training data to only intervals of interest and that have complete feature mapping (i.e. do not include intervals that do not have geo grids if you want to compare them against other intervals that are fully geo-attributed)
completionYear >= 2014 - Limit the training data to only wells that have been completed with recently similar operational strategies. This mitigates noise from older operational and well designs that may not be delineated in the underlying well data.
lateralLength > 3000 & lateralLength <= 20000 - Limit the training data to only wells with lateral lengths representative of the intended strategy for future development. Also eliminates erroneously reported lateral lengths. lateralLength is an important feature and filter because it is integral to the target training in the model.
totalProppantByPerfLength >=1000 & totalProppantByPerfLength <= 3750 - Limit the training data to only use wells with proppant intensity volumes representative of the intended strategy for future development. Also eliminates erroneously reported lateral lengths and proppant volumes.
totalFluidByPerfLength >= 500 & totalFluidByPerfLength <= 4500 - Limit the training data the only use wells with fluid intensity volumes representative of the intended strategy for future development. Also eliminates erroneously reported lateral lengths and fluid volumes.
prodOil12mo >= 1000 & prodOil12mo <= 1000000 - Limit the training data to only use wells that have a minimum amount of production and also eliminate erroneously high or low production data.

# Evaluation Filters

# Suggested Usage

Each model includes an evaluation_filter, which defines the subset of wells used when applying the trained model. This ensures predictions are only made on wells that match the quality and domain constraints used during training. The evaluation_filter safeguards model integrity by:

Preventing invalid predictions on outliers or poor-quality data.
Aligning training and inference domains.
Promoting robust predictions across real, proposed, and virtual wells.

The evaluation_filter is applied in the following compute pipeline phases:

core3: PDP (Producing) Wells
grid: Grid Wells (Virtual Locations)
inv: Inventory Wells

# Model Features

# Suggested Usage

Provide a list of column names from the CORE_well_features table to be used for model training. The list below offers general guidance on recommended features and rationale. Avoid highly co-dependent variables and favor features grounded in first-order physical principles rather than interpretive or derivative values to help with model explainability. If rock property attributes are not available, the well coordinates may be used as a substitute.

If using features from the WellExtra table, prefix column names with wellExtras_ (e.g., wellExtras_proppCon_lb-gal). Features beginning with attributes_ reflect values sampled from geo attribute grids.

Be mindful not to include too many features relative to the size of your training set (see Best Practices below).

The following features are suggested for constructing high-quality models that incorporate key production drivers from both subsurface petrophysics and completion engineering parameters.

lateralLength – Recommended for capturing the impact of production efficiency across lateral lengths (e.g., a 10,000 ft lateral is not expected to perform the same per foot as a 5,000 ft lateral).
totalDrainage – Recommended for representing well spacing effects and drainage from offset producers.
totalProppantByPerfLength – Recommended for quantifying the impact of completion design.
totalFluidByPerfLength – Recommended for quantifying the impact of completion design.
attributes_*Saturation – Use a petrophysical feature to represent fluid saturation.
attributes_*FluidAPI – Use a feature that quantifies the thermal maturity or quality of produced fluids.
attributes_*BulkVolume – Use a feature that reflects the volume of producible fluid or rock transmissibility.
attributes_*GeoMechanics – Use a feature that captures geomechanical rock properties.

# Best Practices

To ensure models are performant, interpretable, and ready for production, follow these best practices:

# Filters and Data Integrity

Include a date-based filter (e.g., completionYear >= 2010) to reflect modern completion practices.
Align training and evaluation filters to maintain consistency and avoid data leakage.
Filter out outliers, bad data, or non-standard wells.
Ensure the training data covers the range of values expected during prediction to avoid extrapolation.

# Feature Selection

Include both geological and engineering features to capture the full range of production drivers.
Add any features you want to run sensitivity or SHAP analysis on (e.g., proppant, fluid, parent production).
Avoid strongly correlated or redundant features that may confuse model learning.

# Model Design

Model training groups should be mutually exclusive to avoid overlap and confusion during prediction—each well should clearly belong to only one model’s filter criteria.
Consider training separate models by interval fluid phase to increase precision (e.g. oil, wet gas, dry gas). Separate models should group wells by factors that represent fundamental differences in the physics driving well performance.
Use a sufficient sample size per model to avoid overfitting.
Balance feature count to dataset size: A good rule of thumb is to have at least 20–30 wells per feature to reduce overfitting and improve model stability, but more wells with fewer features is okay.
Balance model complexity with interpretability depending on your use case.

# Prediction Readiness

Validate predictions on PDP wells before applying to inventory or grid wells.
Review evaluation_filter after prediction if wells are unexpectedly dropping out.
Perform sanity checks using the DIAG Insights dashboard for each model.

# Versioning

Document the intent and rationale in the Build Notes for each model run.
Be sure to reference the desired feature build version (e.g., 3.2.X) using the From Build selector when training new models to ensure consistent feature engineering across training and prediction.
When making inventory or grid predictions, reference the correct model version using the From Build selector to ensure consistency with the intended training configuration.

← Build Code How To Guides →