The Pattern
A team builds their first production ML model. It works. Then someone asks: “Should we use a feature store?”
The question implies the answer. Feature stores are standard MLOps infrastructure. Every mature ML organization has one. The vendors say so. The conference talks recommend them. Not having a feature store feels like technical debt.
So the team spends three months evaluating Feast, Tecton, and Databricks Feature Store. Another two months integrating the chosen solution. Another month debugging why features aren’t matching between training and serving.
Six months later, they’re serving predictions from a feature store that recomputes features on every request - exactly what they were doing before, but with more complexity and latency.
This pattern repeats constantly. Feature stores solve real problems. But most teams don’t have those problems yet.
What Feature Stores Actually Solve
Feature stores solve three specific problems:
Problem 1: Training-Serving Skew
When training uses different feature computation logic than serving. The model trains on sum(purchases_last_30_days) but serves with sum(purchases_last_month) - different results, model breaks.
Problem 2: Feature Recomputation
When multiple models need the same features. Computing user_lifetime_value independently for each model wastes resources.
Problem 3: Point-in-Time Correctness
When training needs historical feature values. For a prediction made on 2024-06-15, what was user_tier on that date? Naive joins use current values, introducing data leakage.
These are real problems. If you have them, feature stores help. But you might not have them yet.
When You Don’t Need a Feature Store
You Have One Model
If you have a single model, training-serving skew is easy to avoid without infrastructure:
# features.py - single source of truth
def compute_features(user_data, transaction_data):
"""Used by both training and serving"""
return {
'total_purchases': len(transaction_data),
'avg_purchase_value': mean([t.amount for t in transaction_data]),
'days_since_last_purchase': (today() - max([t.date for t in transaction_data])).days,
# ... more features
}
# training.py
features = compute_features(user_data, transactions)
model.train(features, labels)
# serving.py
features = compute_features(user_data, transactions)
prediction = model.predict(features)This works. It’s simple. It’s maintainable. Skew is impossible - same code path for both.
Why this works: With one model, feature logic fits in one file. No coordination needed. No shared infrastructure required.
When this breaks: When you have 10 models and each reimplements compute_features() slightly differently. Now you have skew risk and maintenance burden.
Your Features Are Request-Scoped
If features only use data in the request, there’s nothing to store:
# Request contains everything needed
@app.post("/predict")
def predict(request: PredictRequest):
features = {
'transaction_amount': request.amount,
'merchant_category': request.merchant_category,
'is_international': request.country != 'US',
'hour_of_day': datetime.now().hour,
}
return model.predict(features)Why this works: No historical data needed. No precomputation needed. Feature store would add latency without benefit.
When this breaks: When you need user_average_transaction_amount or merchant_fraud_rate - data not in the request. Now you need storage.
You Can Tolerate Batch Predictions
If predictions can be computed overnight and cached, feature stores are overkill:
# Nightly batch job
def compute_all_predictions():
users = load_all_users()
for user in users:
features = compute_features(user)
prediction = model.predict(features)
cache.set(f"prediction:{user.id}", prediction)
# Serving just reads cache
@app.get("/prediction/{user_id}")
def get_prediction(user_id: str):
return cache.get(f"prediction:{user_id}")Why this works: Features computed once per day. Predictions cached. Serving is just cache lookup. No online feature computation needed.
When this breaks: When predictions need to be real-time based on latest data. Now you need online features.
Your Training Data Is Small
If your training dataset fits in memory, point-in-time correctness is a SQL query:
# Training with point-in-time correctness
training_data = db.query("""
SELECT
u.user_id,
u.created_at,
COUNT(t.id) as num_transactions,
AVG(t.amount) as avg_transaction
FROM events e
JOIN users u ON e.user_id = u.user_id
LEFT JOIN transactions t ON t.user_id = u.user_id
AND t.timestamp < e.timestamp -- Point-in-time correctness
WHERE e.label IS NOT NULL
GROUP BY u.user_id, u.created_at
""")Why this works: Database handles point-in-time joins. No feature store materialization needed. Results are fast enough for typical model training.
When this breaks: When you have billions of training examples and complex feature joins. Now the SQL query takes hours. Feature store precomputation becomes necessary.
What to Use Instead
If you don’t need a feature store, use simpler alternatives:
Alternative 1: Shared Feature Functions
# features/user_features.py
def compute_user_features(user_id: str, as_of: datetime = None):
"""Compute user features for training or serving
Args:
user_id: User identifier
as_of: Timestamp for point-in-time correctness (training)
If None, uses current time (serving)
"""
as_of = as_of or datetime.now()
transactions = db.query(
"SELECT * FROM transactions WHERE user_id = ? AND timestamp < ?",
user_id, as_of
)
return {
'num_transactions': len(transactions),
'total_spent': sum(t.amount for t in transactions),
'avg_transaction': mean([t.amount for t in transactions]),
'days_since_last': (as_of - max([t.timestamp for t in transactions])).days
}
# Training uses as_of for point-in-time correctness
train_features = [
compute_user_features(ex.user_id, as_of=ex.timestamp)
for ex in training_examples
]
# Serving uses current time
serve_features = compute_user_features(request.user_id)Advantages:
- Training-serving skew impossible (same code)
- Point-in-time correctness handled
- No new infrastructure
- Easy to debug (just Python)
Disadvantages:
- Repeated computation (no caching across models)
- Slow for many models or large-scale training
Alternative 2: Cached Aggregations
# Precompute expensive features, cache results
class FeatureCache:
def __init__(self, cache_ttl_seconds=300):
self.cache = {}
self.ttl = cache_ttl_seconds
def get_user_features(self, user_id: str):
cache_key = f"user_features:{user_id}"
# Check cache
if cache_key in self.cache:
cached_value, timestamp = self.cache[cache_key]
if time.time() - timestamp < self.ttl:
return cached_value
# Compute and cache
features = self._compute_user_features(user_id)
self.cache[cache_key] = (features, time.time())
return features
def _compute_user_features(self, user_id):
# Expensive computation here
return compute_features(user_id)
# Use in serving
feature_cache = FeatureCache(cache_ttl_seconds=300)
@app.post("/predict")
def predict(request: PredictRequest):
features = feature_cache.get_user_features(request.user_id)
return model.predict(features)Advantages:
- Fast serving (cache hits avoid computation)
- No infrastructure beyond Redis/Memcached
- TTL controls freshness
- Works for multiple models
Disadvantages:
- Cache invalidation complexity
- No point-in-time correctness for training
- Need to handle cache misses
Alternative 3: Materialized Views
# Database-native feature materialization
db.execute("""
CREATE MATERIALIZED VIEW user_features AS
SELECT
user_id,
COUNT(*) as num_transactions,
SUM(amount) as total_spent,
AVG(amount) as avg_transaction,
MAX(timestamp) as last_transaction_date
FROM transactions
GROUP BY user_id
""")
# Refresh periodically (e.g., hourly)
db.execute("REFRESH MATERIALIZED VIEW user_features")
# Training queries the view
train_features = db.query("""
SELECT u.*, f.*
FROM training_examples u
JOIN user_features f ON u.user_id = f.user_id
""")
# Serving queries the view
serve_features = db.query(
"SELECT * FROM user_features WHERE user_id = ?",
user_id
)Advantages:
- Database-native (no new systems)
- Fast reads (precomputed)
- SQL-based (familiar tools)
- Works for moderate scale
Disadvantages:
- Refresh lag (data staleness)
- Less flexible than code
- Doesn’t scale to billions of features
When You Actually Need a Feature Store
You need a feature store when:
1. Multiple teams, many models
When 5 teams are building 20 models and all need user_lifetime_value. Reimplementing it 20 times creates skew risk and maintenance burden.
2. Real-time features at scale
When you need sub-100ms serving with features computed from terabytes of data. Materialized views and caches don’t scale to this.
3. Complex point-in-time correctness
When training requires accurate historical feature values across dozens of feature types with different update frequencies.
4. Feature reuse is proven valuable
When you measure that 80% of features are shared across models. Not when you hope they might be shared someday.
5. Feature computation is expensive
When computing features costs more than storing them. For example, complex aggregations over streaming data.
At this point, feature store infrastructure pays for its complexity. Before this point, it’s premature optimization.
The Migration Path
If you start simple and later need a feature store, migration is straightforward:
Phase 1: Shared functions (current state)
def compute_features(user_id):
# Compute on demand
return featuresPhase 2: Add caching
def compute_features(user_id):
cached = cache.get(f"features:{user_id}")
if cached:
return cached
features = _compute(user_id)
cache.set(f"features:{user_id}", features, ttl=300)
return featuresPhase 3: Separate computation from serving
# Background job precomputes features
def precompute_features():
for user_id in active_users():
features = compute_features(user_id)
feature_store.write(user_id, features)
# Serving reads precomputed features
def get_features(user_id):
return feature_store.read(user_id)Phase 4: Add feature store
# Now using Feast/Tecton/etc
features = feature_store.get_online_features(
entity_rows=[{"user_id": user_id}],
features=["user_lifetime_value", "transaction_count"]
)Each phase works independently. You only move to the next phase when current phase’s limitations become painful.
The Unsexy Truth
Feature stores solve real problems. But those problems appear at scale, not at the start.
Most teams building their first few models don’t have:
- Dozens of models competing for feature computation resources
- Terabytes of feature data requiring specialized storage
- Complex point-in-time correctness requirements across teams
What they have:
- One or two models
- Features that fit in a database
- Team small enough to coordinate in Slack
For these teams, a feature store is complexity without benefit. Shared functions and basic caching solve the same problems with less infrastructure.
Build the simple thing first. Add complexity when you have evidence you need it. You’ll know when that time comes - your team will be spending more time working around the limitations of simple approaches than they would spend adopting a feature store.
Until then, skip it.
Related Reading
For more on infrastructure decisions in production ML:
- Production AI Systems: What 30 Years of UNIX Taught Me - Principles for avoiding premature complexity
- Model Serving Architecture Patterns - When to choose simple vs complex serving architectures