Open Source Feature Stores
There are currently two main open-source feature stores used in production today: Hopsworks and Feast.
Hopsworks Feature Store was the first open-source feature store, released in December 2018, followed shortly thereafter by Feast. Hopsworks is available under the AGPL-V3 license, while Feast is available under the Apache v2 license. Hopsworks is developed by Logical Clocks, while Feast is developed primarily by Tecton and GoJek, but is part of the CNCF.
Hopsworks
Hopsworks is a stand-alone platform that you can install with an installer, and out-of-the-box, it provides support for:
- feature computation (Spark, PySpark, Python, Spark Streaming);
- offline feature storage using Hive and HopsFS;
- online feature storage using RonDB. RonDB also is the Hive metastore and backend database for Hopsworks;
- streaming ingestion with Apache Kafka;
- a complete data science platform (optional) with Jupyter notebooks and Jobs for model training and feature engineering with Python, Spark, or Flink,
- model serving support with TensorFlow serving server and flask.
Hopsworks open-source is fully featured compared to the enterprise version, but it does not include support for single-sign on (Active Directory, OAuth-2) or integration with external Kubernetes clusters (to run notebook servers, Jobs, or serve models).
Feast
Feast is a smaller system than Hopsworks, and is often deployed on a kubernetes cluster (with Helm charts or terraform scripts) but can also be deployed on AWS, Azure,or GCS. Feast was originally designed to work on GCP with BigQuery, but in late 2020 switched to use Spark as the engine for ingesting features from external sources. Feast does not come with a UI, security, or the ability to do feature engineering. It needs to be connected up to existing services and databases to provide its functionality:
- you need to connect to a Spark cluster to be able to ingest features (like EMR on AWS);
- you need a Postgres database (like RDS on AWS) to store feature metadata;
- you need a Kafka cluster (like managed Kafka on AWS) to ingest streaming data;
- you need a Redis database (like Elasticache on AWS) to serve features online with low latency.