A look at feature stores to enable machine learning
Working with data is messy and getting it ready for any kind of model building is a painstaking and time-consuming process. Building feature stores is meant to optimize this process for all machine learning teams in an organization.
The concept of a feature store is growing in popularity with Uber introducing the concept in 2017 as part of the data management layer in their machine learning platform, Michelangelo. Netflix used a similar concept called a feature encoder in their time machine, Delorean. Other companies embracing feature stores include Linkedin and Airbnb.
What is a feature store?
A feature store is a curated collection of pre-computed features, stored in a centralized location. Features are calculated by joining raw data sources and transforming them into formats that can be interpreted by machine learning models. They are then stored under descriptive names in a central location, making them identifiable and accessible to the teams using them.
In addition to storing the features, the feature store has the functionality to continuously update the feature calculations as new data is added and track that data lineage. They also have the functionality to create new features and backfill them using the historical data.
But what is it for?
Data scientists at Airbnb spend up to 60% of their time engineering features to be used for machine learning and Uber says during model training the most time is spent transforming and selecting features. Creating these features is a fundamental part of the machine learning workflow but also a an inefficient one.
Feature stores solve several problems in the machine learning workflow:
1. If features are calculated ad-hoc as part of the pipeline, they are difficult to reuse across different pipelines. This wastes time and resources during the experimentation phase. Feature stores standardize how features are accessed and make them reusable across pipelines and teams.
2. It allows teams to share insights by including new features that they have found useful in the store.
3. Feature stores also create a consistency in data. Various teams may unknowingly be creating the same features, a waste of resources but also various implementations may produce different results obscuring the definition and understanding of the feature to the teams and the company.
4. Features are automatically updated when new data comes in, but the data lineage is tracked. This way teams do not have to redo all the calculations to update their features, but they can still replicate experiments from before data updates occur. This is important for explaining model decisions and tracking model performance over time.
5. Features calculated during the experimentation phase in the data scientists preferred language may not be perfectly replicable in the production system. This causes an inconsistency in the features used for training and those calculated for live predictions. Performance of a model is then difficult to measure accurately until it is deployed, which can be a costly mistake.
The feature store makes it possible for teams to search through a catalogue of features for what they need. Teams can gain insight from other teams by browsing the existing features or sharing their own when they add new ones. The time saved on feature engineering can then be dedicated to model building. Since this is a very experimental process, the time added to the model building phase allows more ideas to be tested and a better solution built.
Off-the-shelf or custom-built?
Netflix, Uber, and Airbnb have all created their own custom feature store implementations. Since these companies were at the forefront, they did not have an off-the-shelf solution available to them. However, not every company needs a completely custom-built option and now there are off-the-shelf options available. Open-source feature store services are available from Hopsworks and Feast with Airbnb planning to open source their solution, Zipline, in the future.
Building a reliable, speedy feature store could be a massive undertaking as it will need to malleable enough to create new features and backfill them, produce the results in a reasonable amount of time and be queryable from multiple platforms and finally be accessible to both offline sandboxes and online deployed models.
Off-the-shelf options are faster to get up and running and will show return on investment sooner, but they are still quite new and there isn’t feedback yet on how they are doing within organisations that have adopted them.
Want to read more?
Feature Store: the missing data layer in ML pipelines?: https://www.logicalclocks.com/feature-store/
Introducing Feast: https://towardsdatascience.com/introducing-feast-5eb4d539c13f
Meet Michelangelo: Uber’s Machine Learning Platform: https://eng.uber.com/michelangelo/
Distributed Time Travel for Feature Generation: https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907