Storage Backends¶

Backends determine how and where SHAP explanations are stored and retrieved.

ParquetBackend¶

The default backend using Apache Parquet format for efficient columnar storage.

Basic Usage¶

from shapmonitor.backends import ParquetBackend

# Create backend
backend = ParquetBackend("/path/to/shap_logs")

# Use with SHAPMonitor
from shapmonitor import SHAPMonitor

monitor = SHAPMonitor(
    explainer=explainer,
    backend=backend  # Or use data_dir instead
)

Directory Structure¶

Parquet files are organized using Hive-style partitioning for efficient querying and compatibility with tools like Dask, Spark, and DuckDB:

shap_logs/
├── date=2025-12-26/
│   ├── uuid-1234.parquet
│   ├── uuid-5678.parquet
│   └── uuid-9abc.parquet
├── date=2025-12-27/
│   ├── uuid-def0.parquet
│   └── uuid-1234.parquet
└── date=2025-12-28/
    └── uuid-5678.parquet

With multiple partition keys (e.g., partition_by=["date", "model_version"]):

shap_logs/
├── date=2025-12-26/
│   ├── model_version=v1.0/
│   │   └── uuid-1234.parquet
│   └── model_version=v2.0/
│       └── uuid-5678.parquet
└── date=2025-12-27/
    └── model_version=v1.0/
        └── uuid-9abc.parquet

Each file contains one batch of explanations.

Data Schema¶

Each Parquet file contains:

Column	Type	Description
`timestamp`	datetime	When batch was logged
`batch_id`	string	Unique batch identifier (UUID)
`model_version`	string	Model version identifier
`n_samples`	int	Number of samples in batch
`base_value`	float	Expected value from explainer
`shap_{feature}`	float	SHAP value for each feature
`{feature}`	float	Original feature value
`prediction`	float	Model prediction (if provided)

Example:

import pandas as pd

# Read a Parquet file directly
df = pd.read_parquet("shap_logs/date=2025-12-26/uuid-1234.parquet")

print(df.columns)
# Index(['timestamp', 'batch_id', 'model_version', 'n_samples', 'base_value',
#        'shap_MedInc', 'shap_HouseAge', 'MedInc', 'HouseAge', 'prediction'], dtype='object')

# Read entire directory with Hive partitioning (partition columns auto-added)
df = pd.read_parquet("shap_logs/")
print(df.columns)
# Includes 'date' column from Hive partition

Configuration¶

backend = ParquetBackend(file_dir="/path/to/logs")

# With custom partitioning (default is ["date"])
backend = ParquetBackend(
    file_dir="/path/to/logs",
    partition_by=["date", "model_version"]  # Hive-style: date=.../model_version=.../
)

# Supported partition keys: "date", "batch_id", "model_version"

# Access configuration
backend.file_dir  # Path object
backend.partition_by  # List of partition keys

Backend Operations¶

Write¶

Write a batch of explanations:

from shapmonitor.types import ExplanationBatch
from datetime import datetime

# Create batch (normally done by SHAPMonitor)
batch = ExplanationBatch(
    timestamp=datetime.now(),
    batch_id="uuid-1234",
    model_version="v1.0",
    n_samples=100,
    base_values=[0.5] * 100,
    shap_values={"feature_1": [0.1, 0.2, ...], ...},
    feature_values={"feature_1": [1.0, 2.0, ...], ...},
    predictions=[0.6, 0.7, ...]
)

# Write to backend
path = backend.write(batch)
print(f"Wrote to {path}")

Read¶

Read explanations from a date range with optional filtering:

from datetime import datetime, timedelta

# Read single day
today = datetime.now()
df = backend.read(today, today)

# Read date range
week_ago = today - timedelta(days=7)
df = backend.read(week_ago, today)

# Filter by model version (uses partition pruning for efficiency)
df = backend.read(week_ago, today, model_version="v1.0")

# Filter by batch_id
df = backend.read(today, batch_id="uuid-1234")

print(f"Read {len(df)} samples")

Parameters:

start_dt: Start date (inclusive)
end_dt: End date (inclusive), defaults to start_dt
batch_id: Optional batch ID filter
model_version: Optional model version filter

Returns:

DataFrame with all samples matching the filters
Empty DataFrame if no data found

Delete¶

Delete old data to manage storage:

from datetime import datetime, timedelta

# Delete data older than 30 days
cutoff = datetime.now() - timedelta(days=30)
deleted_count = backend.delete(cutoff)

print(f"Deleted {deleted_count} partitions")

Parameters:

cutoff_dt: Delete data before this date

Returns:

Number of date partitions deleted

Parquet Benefits¶

Storage Efficiency¶

Columnar format: Only read columns you need
Compression: Efficient compression algorithms
Type optimization: Stores data types efficiently

Typical compression ratios: 5-10x compared to CSV.

Query Performance¶

# Fast: Only reads required dates
df = backend.read(
    datetime(2025, 12, 26),
    datetime(2025, 12, 27)
)  # Only reads 2 days

# Efficient: Columnar access
shap_values = df[['shap_feature_1', 'shap_feature_2']]  # Only reads 2 columns

Compatibility¶

Readable by pandas, pyarrow, DuckDB, Spark
Standard format for data science workflows
Easy integration with data lakes

Storage Management¶

Monitor Storage Size¶

# Check storage usage
du -sh /path/to/shap_logs

# List partitions
ls -lh /path/to/shap_logs/

Retention Policy¶

Implement a retention policy to manage storage:

from datetime import datetime, timedelta

def cleanup_old_data(backend, retention_days=30):
    """Delete data older than retention_days."""
    cutoff = datetime.now() - timedelta(days=retention_days)
    deleted = backend.delete(cutoff)
    print(f"Deleted {deleted} partitions older than {retention_days} days")

# Run periodically (e.g., daily cron job)
cleanup_old_data(backend, retention_days=90)

Archival Strategy¶

For long-term storage:

import shutil
from datetime import datetime, timedelta

# Archive data older than 1 year to cold storage
cutoff = datetime.now() - timedelta(days=365)
archive_dir = "/archive/shap_logs"

# Copy then delete
shutil.copytree(
    "/path/to/shap_logs",
    archive_dir,
    ignore=lambda dir, files: [f for f in files if is_newer_than(f, cutoff)]
)
backend.delete(cutoff)

Custom Backends¶

The backend interface is pluggable. Future versions may support:

S3/Cloud storage
Database backends (PostgreSQL, DuckDB)
Time-series databases

To implement a custom backend, extend BaseBackend:

from shapmonitor.backends._base import BaseBackend

class CustomBackend(BaseBackend):
    def read(self, start_dt, end_dt):
        # Implement read logic
        pass

    def write(self, batch):
        # Implement write logic
        pass

    def delete(self, cutoff_dt):
        # Implement delete logic
        pass

Performance Tips¶

Batch Size¶

Larger batches → fewer files → better read performance
Smaller batches → more granular timestamps
Recommended: 100-1000 samples per batch

Date Partitioning¶

Daily partitions work well for most use cases
Automatic in ParquetBackend
Enables efficient date range queries

Compression¶

Parquet automatically compresses data. For custom compression:

# When writing directly to Parquet
df.to_parquet(
    path,
    compression='snappy',  # Fast compression
    # compression='gzip',   # Better compression ratio
    index=False
)

Concurrent Access¶

ParquetBackend is thread-safe for:

Multiple writers (different batch IDs)
Concurrent reads
Read while writing

Troubleshooting¶

File Not Found¶

# Check if directory exists
if backend.file_dir.exists():
    print("Backend directory exists")
else:
    print("Backend directory not found")

Empty Results¶

# Verify data exists for date range
df = backend.read(start_date, end_date)

if df.empty:
    print("No data in date range")
    # Check what dates have data (Hive-style directories)
    for date_dir in sorted(backend.file_dir.glob("date=*")):
        if date_dir.is_dir():
            print(f"Data available: {date_dir.name}")

Corrupt Files¶

# List all files in partition (Hive-style)
from pathlib import Path

partition = backend.file_dir / "date=2025-12-26"
for file in partition.rglob("*.parquet"):
    try:
        df = pd.read_parquet(file)
        print(f"✓ {file.name}: {len(df)} rows")
    except Exception as e:
        print(f"✗ {file.name}: {e}")