Storage Backends¶
Backends determine how and where SHAP explanations are stored and retrieved.
ParquetBackend¶
The default backend using Apache Parquet format for efficient columnar storage.
Basic Usage¶
from shapmonitor.backends import ParquetBackend
# Create backend
backend = ParquetBackend("/path/to/shap_logs")
# Use with SHAPMonitor
from shapmonitor import SHAPMonitor
monitor = SHAPMonitor(
explainer=explainer,
backend=backend # Or use data_dir instead
)
Directory Structure¶
Parquet files are organized using Hive-style partitioning for efficient querying and compatibility with tools like Dask, Spark, and DuckDB:
shap_logs/
├── date=2025-12-26/
│ ├── uuid-1234.parquet
│ ├── uuid-5678.parquet
│ └── uuid-9abc.parquet
├── date=2025-12-27/
│ ├── uuid-def0.parquet
│ └── uuid-1234.parquet
└── date=2025-12-28/
└── uuid-5678.parquet
With multiple partition keys (e.g., partition_by=["date", "model_version"]):
shap_logs/
├── date=2025-12-26/
│ ├── model_version=v1.0/
│ │ └── uuid-1234.parquet
│ └── model_version=v2.0/
│ └── uuid-5678.parquet
└── date=2025-12-27/
└── model_version=v1.0/
└── uuid-9abc.parquet
Each file contains one batch of explanations.
Data Schema¶
Each Parquet file contains:
| Column | Type | Description |
|---|---|---|
timestamp |
datetime | When batch was logged |
batch_id |
string | Unique batch identifier (UUID) |
model_version |
string | Model version identifier |
n_samples |
int | Number of samples in batch |
base_value |
float | Expected value from explainer |
shap_{feature} |
float | SHAP value for each feature |
{feature} |
float | Original feature value |
prediction |
float | Model prediction (if provided) |
Example:
import pandas as pd
# Read a Parquet file directly
df = pd.read_parquet("shap_logs/date=2025-12-26/uuid-1234.parquet")
print(df.columns)
# Index(['timestamp', 'batch_id', 'model_version', 'n_samples', 'base_value',
# 'shap_MedInc', 'shap_HouseAge', 'MedInc', 'HouseAge', 'prediction'], dtype='object')
# Read entire directory with Hive partitioning (partition columns auto-added)
df = pd.read_parquet("shap_logs/")
print(df.columns)
# Includes 'date' column from Hive partition
Configuration¶
backend = ParquetBackend(file_dir="/path/to/logs")
# With custom partitioning (default is ["date"])
backend = ParquetBackend(
file_dir="/path/to/logs",
partition_by=["date", "model_version"] # Hive-style: date=.../model_version=.../
)
# Supported partition keys: "date", "batch_id", "model_version"
# Access configuration
backend.file_dir # Path object
backend.partition_by # List of partition keys
Backend Operations¶
Write¶
Write a batch of explanations:
from shapmonitor.types import ExplanationBatch
from datetime import datetime
# Create batch (normally done by SHAPMonitor)
batch = ExplanationBatch(
timestamp=datetime.now(),
batch_id="uuid-1234",
model_version="v1.0",
n_samples=100,
base_values=[0.5] * 100,
shap_values={"feature_1": [0.1, 0.2, ...], ...},
feature_values={"feature_1": [1.0, 2.0, ...], ...},
predictions=[0.6, 0.7, ...]
)
# Write to backend
path = backend.write(batch)
print(f"Wrote to {path}")
Read¶
Read explanations from a date range with optional filtering:
from datetime import datetime, timedelta
# Read single day
today = datetime.now()
df = backend.read(today, today)
# Read date range
week_ago = today - timedelta(days=7)
df = backend.read(week_ago, today)
# Filter by model version (uses partition pruning for efficiency)
df = backend.read(week_ago, today, model_version="v1.0")
# Filter by batch_id
df = backend.read(today, batch_id="uuid-1234")
print(f"Read {len(df)} samples")
Parameters:
start_dt: Start date (inclusive)end_dt: End date (inclusive), defaults tostart_dtbatch_id: Optional batch ID filtermodel_version: Optional model version filter
Returns:
- DataFrame with all samples matching the filters
- Empty DataFrame if no data found
Delete¶
Delete old data to manage storage:
from datetime import datetime, timedelta
# Delete data older than 30 days
cutoff = datetime.now() - timedelta(days=30)
deleted_count = backend.delete(cutoff)
print(f"Deleted {deleted_count} partitions")
Parameters:
cutoff_dt: Delete data before this date
Returns:
- Number of date partitions deleted
Parquet Benefits¶
Storage Efficiency¶
- Columnar format: Only read columns you need
- Compression: Efficient compression algorithms
- Type optimization: Stores data types efficiently
Typical compression ratios: 5-10x compared to CSV.
Query Performance¶
# Fast: Only reads required dates
df = backend.read(
datetime(2025, 12, 26),
datetime(2025, 12, 27)
) # Only reads 2 days
# Efficient: Columnar access
shap_values = df[['shap_feature_1', 'shap_feature_2']] # Only reads 2 columns
Compatibility¶
- Readable by pandas, pyarrow, DuckDB, Spark
- Standard format for data science workflows
- Easy integration with data lakes
Storage Management¶
Monitor Storage Size¶
# Check storage usage
du -sh /path/to/shap_logs
# List partitions
ls -lh /path/to/shap_logs/
Retention Policy¶
Implement a retention policy to manage storage:
from datetime import datetime, timedelta
def cleanup_old_data(backend, retention_days=30):
"""Delete data older than retention_days."""
cutoff = datetime.now() - timedelta(days=retention_days)
deleted = backend.delete(cutoff)
print(f"Deleted {deleted} partitions older than {retention_days} days")
# Run periodically (e.g., daily cron job)
cleanup_old_data(backend, retention_days=90)
Archival Strategy¶
For long-term storage:
import shutil
from datetime import datetime, timedelta
# Archive data older than 1 year to cold storage
cutoff = datetime.now() - timedelta(days=365)
archive_dir = "/archive/shap_logs"
# Copy then delete
shutil.copytree(
"/path/to/shap_logs",
archive_dir,
ignore=lambda dir, files: [f for f in files if is_newer_than(f, cutoff)]
)
backend.delete(cutoff)
Custom Backends¶
The backend interface is pluggable. Future versions may support:
- S3/Cloud storage
- Database backends (PostgreSQL, DuckDB)
- Time-series databases
To implement a custom backend, extend BaseBackend:
from shapmonitor.backends._base import BaseBackend
class CustomBackend(BaseBackend):
def read(self, start_dt, end_dt):
# Implement read logic
pass
def write(self, batch):
# Implement write logic
pass
def delete(self, cutoff_dt):
# Implement delete logic
pass
Performance Tips¶
Batch Size¶
- Larger batches → fewer files → better read performance
- Smaller batches → more granular timestamps
- Recommended: 100-1000 samples per batch
Date Partitioning¶
- Daily partitions work well for most use cases
- Automatic in ParquetBackend
- Enables efficient date range queries
Compression¶
Parquet automatically compresses data. For custom compression:
# When writing directly to Parquet
df.to_parquet(
path,
compression='snappy', # Fast compression
# compression='gzip', # Better compression ratio
index=False
)
Concurrent Access¶
ParquetBackend is thread-safe for:
- Multiple writers (different batch IDs)
- Concurrent reads
- Read while writing
Troubleshooting¶
File Not Found¶
# Check if directory exists
if backend.file_dir.exists():
print("Backend directory exists")
else:
print("Backend directory not found")
Empty Results¶
# Verify data exists for date range
df = backend.read(start_date, end_date)
if df.empty:
print("No data in date range")
# Check what dates have data (Hive-style directories)
for date_dir in sorted(backend.file_dir.glob("date=*")):
if date_dir.is_dir():
print(f"Data available: {date_dir.name}")
Corrupt Files¶
# List all files in partition (Hive-style)
from pathlib import Path
partition = backend.file_dir / "date=2025-12-26"
for file in partition.rglob("*.parquet"):
try:
df = pd.read_parquet(file)
print(f"✓ {file.name}: {len(df)} rows")
except Exception as e:
print(f"✗ {file.name}: {e}")