Engineering

Iceberg Catalogs 2025: A Deep Dive into Emerging Catalogs and Modern Metadata Management

Ankur Ranjan

June 27, 2025

Apache Iceberg's emerging catalogs: Project Nessie, Apache Gravitino, Apache Polaris, Lakekeeper, and Databricks Unity Catalog

Emerging Apache Iceberg Catalogs: Project Nessie, Apache Gravitino, Apache Polaris, Lakekeeper, and Unity Catalog

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo

In June 2025, the Snowflake Summit 2025 and the Databricks Data+AI Summit 2025 shared many interesting developments and thoughts, which made me realize that the catalog war is still on.

The competition for Apache Iceberg catalogs certainly remains fierce and relevant. Several new catalogs have emerged in recent times, including Apache Polaris™ (incubating), Apache Gravitino, Lakekeeper, and Project Nessie, among others.

Databricks’ recent launch of Unity Catalog GA support for Apache Iceberg has introduced an exciting new facet to its ecosystem.

At the same time, the DuckDB announcement of DuckLake is equally important and has caught my attention. DuckLake streamlines lakehouses by employing a standard SQL database for metadata management, eliminating the need for complicated file-based systems while storing data in open formats like Parquet. This approach enhances reliability, speed, and manageability.

As a data engineer, these developments made me very interested and fascinated at the same time. But most of the developer community still banks on “classic” catalogs, such as Hadoop Catalog, Hive Metastore, and AWS Glue, but the landscape is expanding rapidly.

In this blog, I will explain the real meaning of a catalog, starting from the basics and then assessing where traditional options struggle. I’ll then get into what the newer catalogs bring to the table.
‍

What is a Catalog in Apache Iceberg?

It’s the central metadata repository that every query engine consults to know what an Iceberg table looks like and where to find its data.

Simply put, an Iceberg catalog is a metadata store that tracks all your Iceberg tables.

It’s the bookkeeper for table information, including schema, partition layout, the latest snapshot, and the location of the root metadata file (e.g., v1.metadata.json), along with its manifest list and manifest files.

Apache Iceberg organizes data into three layers:

Data layer: Raw data files & delete files. Mostly, data & delete files are Parquet file formats. Sometimes in ORC or AVRO.
Metadata layer: Information about the Table metadata. It stores this information in the manifest files, manifest lists, and the root metadata.json that describes every file and snapshot.
Catalog layer: The “phone book” that tells every engine where the latest root metadata.json lives. It is primarily an external service that tracks tables and their metadata.
‍

Flowchart illustrating Apache Iceberg’s table architecture on a grey background, divided horizontally into three labeled tiers—Catalog Layer, Metadata Layer, and Data Layer. — Apache Iceberg’s Table Architecture

‍

Without a catalog, an Iceberg table would just be a bunch of files in storage with no easy way to know which files make up the current table. The catalog provides a central pointer to the table’s current metadata, which in turn points to the current snapshot of the table’s data. This mechanism allows Iceberg to provide ACID guarantees and atomic commits on a data lake. Essentially, the catalog serves as the source of truth for the table’s state.

In practice, whenever you create, update, or drop an Iceberg table, it’s the catalog that records those changes.
For example, adding a new batch of data will create a new metadata JSON file for the table, and the catalog updates the pointer to reference this new metadata (so readers will see the new data in one consistent snapshot).
In a similar vein, dropping a table will remove its entry from the catalog. By handling these operations consistently, the catalog prevents orphaned metadata and keeps tables in a clean, correct state.
The catalog also enables multi-engine access because it behaves as a neutral store of metadata; multiple processing engines (e6data, Spark, Flink, and Trino) can concurrently read and write to Iceberg tables through the same catalog without stepping on each other’s toes.
Some catalog implementations even support transaction locking or other mechanisms to coordinate multiple writers.

Iceberg is flexible about where this catalog lives. Broadly, there are two categories of catalogs in use today :

File-based catalogs – The simplest approach: using a filesystem or object store to keep track of tables. The built-in Hadoop Catalog falls in this category. It doesn’t require any separate service – metadata is stored in files on disk.
Service-based catalogs – These use an external service or database to store table metadata. Examples include using a Hive Metastore, AWS Glue Data Catalog, a JDBC database (such as PostgreSQL or MySQL), or newer, purpose-built services like Project Nessie. They usually add richer features (transaction coordination, security hooks, branching, or cross-table commits) and are the go-to choice for production. Many of the emerging catalogs are trying to follow the Iceberg REST specification.

Here are two of our existing blogs to understand more about catalogs.

With this, you should now have a basic understanding of the catalog in Apache Iceberg. Let’s explore the traditional catalogs available in the market and discuss the reasons why new types of catalogs are being introduced into the Apache Iceberg ecosystem.

‍

Why is a Traditional Catalog Not Enough?

If you’re deep in the Iceberg, you’re probably using Hadoop, AWS Glue, or Hive Metastore. But it’s no secret that these catalogs have their shortcomings. As someone who has used all these catalogs, I have identified some of the reasons why using traditional catalogs is bothersome.

When a data engineer walks into the blazing chaos of a traditional data catalog

‍

Hadoop Catalog

Pain Point	What Does it Mean in Practice?
Data-loss risk	DROP TABLE in a Hadoop catalog deletes data and metadata by default. This means there’s no “external” flag, so an accidental drop can wipe the lake.
Scalability	Commands such as SHOW TABLES trigger directory scans. On S3, each “list” is an API call, so thousands of tables make the listing crawl.
Rigid warehouse path	One catalog ⇢ one root directory. Want two buckets? You spin up two catalogs, which get clunky fast.
Filesystem quirks	Iceberg relies on atomic rename. HDFS and ADLSv2 have it; S3 does not, so concurrent writers can collide unless you bolt on a DynamoDB-based lock table. Workarounds (like using DynamoDB) add external dependencies.

‍

Hive Metastore

Pain Point	What Does it Mean in Practice?
Limited transactions and governance	HMS predates “data-as-code” approaches. It can atomically swap one table pointer, but cannot coordinate multi-table commits or row-level governance without add-ons (Ranger, Lake Formation, etc.).
Requires extra service to run	You need the metastore service plus a backing DB. Many teams offload that to AWS Glue just to avoid ops fatigue.
Scale quirks bleed into Iceberg	HMS locks can become a bottleneck under heavy concurrent writes; performance also hinges on how beefy (or not) the backing DB is.

‍

AWS Glue Data Catalog

Pain Point	What Does it Mean in Practice?
No multi-table transactions	Like Hive, Glue supports only supports atomic updates to one table at a time. It lacks a transaction coordinator or two-phase commit, so updates to multiple tables (e.g., two Iceberg tables) can result in partial commits if a failure occurs mid-process. This is problematic for ETL workflows needing consistency across multiple tables.
AWS-only	Great inside AWS; awkward if you’re hybrid or multi-cloud. High-frequency metadata ops can feel slower than a co-located, self-hosted catalog.
Metadata lag on new Iceberg features	Glue tracks core schema/partition info but still lacks branching, tagging, or cross-table commit APIs.

‍

Which are the Exciting New Age Catalogs?

Hadoop Catalog, HMS, and Glue all satisfy Iceberg’s minimum contract: “keep a table-name → metadata pointer and swap it atomically.” But they stop there. They don’t speak Git-style branching, nor do they have coordinated multi-table commits, nor have they built-in governance. That gap is exactly what the newer catalogs, such as Project Nessie, Apache Polaris, Gravitino, Lakekeeper, and now Unity Catalog, are aiming to close.
‍

Project Nessie

Project Nessie is an open-source transactional catalog for Iceberg that brings Git-style branches, tags, and commit history to data lakes. This lets you experiment on an isolated branch and then merge (or roll back) when you’re satisfied, exactly as the workflow developers are familiar with from the source code.

This introduces a bunch of features for data management:

You can isolate experimental changes on a branch (e.g., dev or staging branches for a dataset).
Confidently test these changes.
Only merge the changes to the main branch when ready, while retaining the ability to time-travel or revert as the commit history is tracked.

Project Nessie: marrying Apache Iceberg’s catalog with Git-style branching to let teams isolate data changes on separate branches and merge only when they’re ready.

‍

In the initial days, Nessie had its own gRPC/REST API and clients to manage commits. However, it has evolved and now supports the standard Iceberg REST Catalog spec, which enables any Iceberg-aware engine to communicate with Nessie natively.

Internally, the server keeps a “version store” commit log in a pluggable backend (in-memory, JDBC/Postgres, RocksDB, Cassandra, MongoDB, DynamoDB, etc.). When e6data, Spark, Flink, Trino, Dremio, or another engine writes through Nessie, each change is recorded as a commit in that log. And, a single commit can touch multiple tables, giving you true cross-table atomicity. This capability is either difficult or impossible to implement in Hive Metastore or AWS Glue.

As a matter of fact, Nessie enables cross-table transactions by design because a single commit in Nessie can include changes to multiple table pointers, making it possible to coordinate changes to several tables in one operation (all landing as one commit ID). This is a major advantage for data lakes that need consistent upgrades of multiple tables together.

So, How Does Nessie Work?

Branches and tags are just references: Nessie is “Git for data.” Each branch (e.g., main, feature_x) is nothing more than a pointer to a specific commit, giving you an isolated view of every Iceberg table. When you insert data on feature_x, Nessie writes a new Iceberg metadata file and records a commit on that branch—the pointer on main stays unchanged until you merge.
Commits are metadata-only. An INSERT writes a new Iceberg metadata file, then Nessie moves the branch pointer to that file in one optimistic-concurrency step.
Back-ends are pluggable. For local trials, you can run the default in-memory or JDBC store; for prod, you might choose Postgres (as in the Docker compose above) or RocksDB/Cassandra for distributed scale
Standard APIs. Because Nessie now speaks the Iceberg REST protocol, any engine or library that implements that spec—Spark 3.5, Flink 1.19, Trino 448, PyIceberg, even DuckDB, can point at http://<nessie-server-address>:19120/api/v2 and work.
‍

What are Nessie’s Key Features?

Feature	How Does this Benefit You?
Git-like branches/tags	CREATE BRANCH (SQL or API) forks the current state; MERGE BRANCH brings changes back. Gives catalog-level version control and time travel.
Cross-table commits	A single commit ID can update many Iceberg tables at once: no half-written state, perfect for multi-table ETL.
Zero-copy isolation	Each branch is a consistent view; the main branch never sees partial changes until you merge
Multi-engine support	Works with e6data, Spark, Flink, Trino, Dremio, etc., via Iceberg REST or native plugins
Auth plugins (OAuth2/OIDC)	Drop Nessie behind your IdP / API gateway for RBAC and audit logging.
CLI & Python SDK	pip install pynessie gives a command-line client and Python automation.
Mature OSS community	Active since 2020, production-hardened in Dremio Arctic and elsewhere

‍

Enough of theory. Let’s install it in our local system and play with it to understand more.

Hands-on with Project Nessie

Now that you’ve read about Nessie and what it can do, let’s get hands-on and get a real-world run with Nessie.

docker-compose.yaml

version: "3.9"
volumes:
  warehouse:
  nessie-db:
networks:
  demo: {}
services:
  # 1️⃣  Postgres 14 – Nessie commit log
  nessie-db:
    image: postgres:14
    environment:
      POSTGRES_DB: nessie
      POSTGRES_USER: nessie
      POSTGRES_PASSWORD: nessie
    volumes: [ nessie-db:/var/lib/postgresql/data ]
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "nessie"]
      interval: 10s
      retries: 5
    networks: [ demo ]
  # 2️⃣  Nessie server
  nessie:
    image: ghcr.io/projectnessie/nessie:0.104.1
    # platform: linux/amd64   # ← uncomment on Apple-silicon
    depends_on:
      nessie-db:
        condition: service_healthy
    environment:
      QUARKUS_HTTP_PORT: 19120
      NESSIE_VERSION_STORE_TYPE: JDBC2
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://nessie-db:5432/nessie
      QUARKUS_DATASOURCE_USERNAME: nessie
      QUARKUS_DATASOURCE_PASSWORD: nessie
    ports: [ "19120:19120" ]
    networks: [ demo ]
  # 3️⃣  Spark 3.5 + Iceberg 1.9 + Jupyter
  spark:
    image: tabulario/spark-iceberg
    depends_on: [ nessie ]
    ports:
      - "8888:8888"   # Jupyter
      - "4040:4040"   # Spark UI
    volumes:
      - warehouse:/opt/warehouse
      - ./conf/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro
      - ./notebooks:/home/iceberg/notebooks:rw           # ← notebooks persist
    environment:
      SPARK_DRIVER_MEMORY: 1G
    networks: [ demo ]

‍

Run this code for a hands-on application of what we’ve done so far. Don’t forget to refer to my comments in the code.

# ── Boot a Spark-SQL session that already knows the Iceberg + Nessie extensions
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# ── 1. Create a demo namespace + table on the *main* branch
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo")

spark.sql("""
    CREATE TABLE demo.sales (
        id  INT,
        amt DOUBLE
    ) USING iceberg
""")                                    # ⇒ metadata-00000.json

# insert one row on main → commit A
spark.sql("INSERT INTO demo.sales VALUES (1, 9.99)")
spark.sql("SELECT * FROM demo.sales").show()   # 1 row

# ── 2. Fork the catalog state
spark.sql("CREATE BRANCH dev IN nessie")       # pointer dev → commit A
spark.sql("USE REFERENCE dev IN nessie")       # work on the branch

# add a second row on dev → commit B
spark.sql("INSERT INTO demo.sales VALUES (2, 19.95)")
spark.sql("SELECT * FROM demo.sales").show()   # 2 rows (dev only)

# ── 3. Switch back to production and verify isolation
spark.sql("USE REFERENCE main IN nessie")
spark.sql("SELECT * FROM demo.sales").show()   # still 1 row

# ── 4. Merge the dev work into main
spark.sql("MERGE BRANCH dev INTO main IN nessie")  # fast-forward main → B
spark.sql("REFRESH TABLE demo.sales")              # force Spark to reload metadata
spark.sql("SELECT * FROM demo.sales").show()       # main now shows 2 rows

‍

Apache Gravitino

Apache Gravitino (incubating) takes an amplified view of the catalog problem. It is described as a “high-performance, geo-distributed, and federated metadata lake.” In simpler terms, Gravitino is a unified metadata and governance platform that acts as an Iceberg table catalog and federates metadata from many different sources and regions.

It was developed (and open-sourced) by the company Datastrato, and it aims to let you manage metadata and govern data across all your data sources, including cloud storage systems (data lakes), relational databases, streaming systems, and more, under one umbrella. This means Gravitino isn’t limited to just Iceberg or just one kind of data; it’s built to be a single source of truth for metadata in a large, heterogeneous data environment.

‍

Apache Gravitino layers a pluggable metadata lake on top of an Iceberg catalog, then synchronizes and mirrors that metadata in real time across every source, region, and cloud.

‍

How Does Apache Gravitino Work?

Provides a technical data catalog and metadata lake that plug into multiple underlying metadata stores through extensible connectors.
Can pull in or directly manage metadata from sources such as Hive Metastore, relational database catalogs, and file systems, exposing everything through a unified API.
Manages these systems directly, so updates in Gravitino or in an underlying source are reflected immediately in the other, eliminating lag or mismatches common in periodic-sync catalogs.
Features a geo-distributed architecture: instances deployed in different regions or clouds coordinate to share metadata, giving users a global, continuously synchronized catalog across data centers.
Builds on Apache Iceberg’s table format and REST catalog concepts, with plans to extend support to other formats like Hudi and Delta Lake.
Strives to act as a universal metadata hub for varied table formats and AI/ML assets.
‍

What are Gravitino's Key Features?

Feature	How Does this Benefit You?
Unified Metadata Management	Provides a unified relational-table model (Hive, MySQL, PostgreSQL) and file-based model for unstructured data (HDFS, S3), abstracting Hive Metastore, object stores, streaming catalogs under one API.
End-to-End Data Governance	Centralized access control, auditing, and data discovery; policies enforced across Spark, Trino, etc., ensuring uniform compliance for warehouses, lakes, and AI datasets.
Federated and Geo-Distributed Architecture	Connector framework spans multiple back-ends (e.g., Hive Metastore, Snowflake) and regions; instances sync to maintain a global namespace while serving low-latency local queries.
High-Performance Metadata Store	Built-in key-value cache and optimized commit logic deliver low-latency, high-concurrency metadata operations, overcoming Hive Metastore’s slowness.
Pluggable & Extensible Modules	Modular extensions for authentication (OAuth2, LDAP), authorization, metrics, event listeners; supports custom monitoring and lineage triggers.
Multi-Engine & Multi-Format Support	Iceberg REST catalog already integrates with Trino; Spark, Flink, Hudi, Delta Lake, and AI/ML asset cataloging on the roadmap for a single multimodal catalog.

‍

Apache Polaris

Apache Polaris (incubating) is an open-source, fully-featured catalog for Apache Iceberg tables. It was originally developed by Snowflake and contributed to Apache, with the goal of being a vendor-neutral Iceberg catalog. Polaris implements Iceberg’s standard REST API for interoperability across many engines (e6data, Spark, Flink, Trino, Dremio, StarRocks, Apache Doris, etc.) and allows them all to create, update, and manage Iceberg tables through a single service.

Essentially, Polaris decouples table metadata from any one compute platform. You can run Polaris as a centralized Iceberg catalog on your own infrastructure (or hosted in Snowflake’s cloud) and have different analytics engines use it concurrently without each needing its own metastore. The incredible bit is that this avoids siloed metadata and duplication of data for different tools.

Apache Polaris is a self-hosted, centralized Iceberg catalog that exposes Iceberg’s standard REST API, allowing engines such as e6data, Snowflake, Databricks, and Trino to read and write the same tables simultaneously through a shared metastore.

‍

How Does Polaris Work?

Acts as a RESTful server that tracks Iceberg table metadata and schema definitions.
Client engines (e.g., Spark, Trino) interact with Polaris through REST API calls to read or write Iceberg tables.
Applies metadata changes atomically (such as committing new snapshots) ensuring ACID guarantees for table operations.
Designed for vendor neutrality: data can reside on any cloud object storage (S3, GCS, Azure Blob, etc.), with an abstraction layer that hides metadata location and format from engines.
Supports namespaces and catalogs to organize tables logically; a single deployment can host multiple catalogs.
Operates in two table modes:
- Internal tables: fully managed by Polaris with full read/write access.
- External tables: managed by another catalog (e.g., Hive, Glue); Polaris treats them as read-only mirrors.
Federation capability lets you gradually migrate from other metastores (e.g., start with Glue or Snowflake in read-only mode, then shift management to Polaris).
Enables hybrid deployments and avoids forcing an all-or-nothing migration path.
‍

What are Apache Polaris’ Key Features?

Feature	How Does this Benefit You?
Iceberg REST API & Multi-Engine Support	Implements the Iceberg REST catalog spec; any Iceberg-aware engine connects via the REST endpoint. All reads/writes flow through Polaris, giving ACID transactions and consistent views. Multiple engines can simultaneously query and update the same Iceberg tables while always seeing a consistent state.
Built-in Governance & RBAC	Snowflake-inspired governance with role-based access control at catalog, namespace, and table levels. Central policy enforcement means no per-tool duplication. Credential vending issues temporary cloud-storage credentials so engines can access only authorized data. This means no long-lived keys are required.
Vendor-Neutral & Flexible Deployment	Polaris is a community-driven project (contributors include Stripe, IBM, Uber, and Starburst). Runs on any infrastructure - on-prem, any cloud, Kubernetes, and any cloud storage. No dependency on Snowflake’s platform, preventing vendor lock-in and keeping metadata portable.
Catalog Federation (Hybrid Mode)	Registers external catalogs (Hive Metastore, AWS Glue, etc.) as read-only sources while Polaris-managed catalogs remain fully transactional. Enables gradual migration to a unified lakehouse without disrupting existing systems.
Robust Cloud Storage Support	Configurable storage backends for S3, Azure Blob, Google Cloud Storage, and more. Catalog tables across clouds; Polaris handles differing APIs and security. Credential vending plus unified governance secures multi-cloud access for all engines.

‍

Lakekeeper

Lakekeeper is another open-source Iceberg catalog but with a different philosophy. It emphasizes simplicity, performance, and security. It is described as “a secure, fast, and user-friendly Apache Iceberg REST Catalog built with Rust” and aims to be a lean and efficient catalog service that is easy to deploy and integrate.

Lakekeeper was started by independent contributors (with a growing community around it) and is released under the Apache License. While Apache Polaris and Gravitino are large-scale projects with backing from big companies, Lakekeeper represents a more lightweight approach: it’s written in Rust (for performance and safety) and distributes as a single binary you can run anywhere (no heavy dependencies like a JVM or big database required).

Despite its lighter footprint, Lakekeeper fully implements the standard Iceberg REST API, meaning it can be used by any engine that speaks that API, similar to Polaris and Gravitino.

Iceberg Catalog + Rust and the Iceberg REST API flow into Lakekeeper, which ships as a single Rust binary and can be used by any Iceberg-compliant engine.

‍

How Does Lakekeeper Work?

Architecture mirrors other Iceberg REST catalogs, featuring a server component that clients register with.
Spark or Flink jobs invoke Lakekeeper REST endpoints to create or update Iceberg tables; Lakekeeper manages commit logic and metadata storage.
Internally, metadata is stored in a relational (normalized) database—either an embedded store or an external SQL database.
This design lets Lakekeeper query and update metadata efficiently and add capabilities (e.g., metadata analytics) without scanning object-storage metadata files.
Compared with a pure “on-file” catalog like Hadoop Catalog, Lakekeeper can more easily provide global statistics, fast lookups, and multi-table transactions.
Server-side commits run inside a single database transaction, enabling true atomic commits that span multiple tables—something traditional metastore catalogs (e.g., Hive) lack.
Distributed as one compiled binary, so it deploys smoothly on a laptop for local tests or in Kubernetes for production.
Includes a web UI and a Helm chart for Kubernetes, simplifying initial setup.
‍

What are Lakekeeper's Key Features?

Feature	How Does this Benefit You?
High-performance, Single-Binary Server	Fast startup, low memory footprint, effortless containerization, horizontal scalability. Rust codebase (no JVM), stateless compute nodes, relational metadata model, “no unsafe code,” Helm chart & built-in UI
Complete Iceberg REST Compatibility	Drop-in catalog for Spark, Flink, Trino, e6data, etc.; multi-table atomic commits & time-travel. Full Apache Iceberg REST API, cross-table transactions, eliminates Hive Metastore / AWS Glue limitations
Fine-grained Access Control and Governance	Centralized, consistent permissions across all query engines. OAuth2/OIDC auth, Open Policy Agent bridge, OpenFGA backend, check API, audit log & change-event hooks
Secure Data Access and Storage Management	Least-privilege object-store access without sharing long-lived keys. Vended credentials & signed URLs for S3, short-lived scope-limited tokens
Automation and Maintenance	Hands-off optimization and lifecycle management for Iceberg tables. Planned background compaction, snapshot expiration, ecosystem-first Rust SDK/CLI, rapid feature adoption

‍

Databricks Unity Catalog

Databricks Unity Catalog is a bit different from the other catalogs discussed, as it originated as a proprietary offering from Databricks rather than an Apache open-source project. Unity Catalog is essentially Databricks’ unified governance solution for data and AI assets across the lakehouse.

Unity Catalog was introduced to provide Databricks customers with a single interface to manage all their tables, files, and machine learning assets (models, etc.) with fine-grained governance. The big news, however, is that in mid-2024, Databricks open-sourced Unity Catalog under the Apache 2.0 license, aiming to make it the industry’s first truly open universal catalog for data and AI.

The project is now hosted under the Linux Foundation (LF AI & Data) rather than Apache, but it’s developed in the open with an API-first approach. Unity Catalog’s importance lies in bridging the gap between traditional data governance (for tables/files) and AI governance (for ML models and more), all while supporting multiple data formats and execution engines.
‍

How Does Unity Catalog Work?

Unity Catalog supplies a central metadata and permission store unified along three dimensions: multi-format, multi-engine, and multimodal.

Multi-format
- Catalogs data in Delta Lake (Databricks’ native table format) as well as Apache Iceberg, Parquet, CSV, and others.
- Introduces UniForm, which presents Delta tables in an Iceberg-compatible way so Iceberg clients can read Delta files, achieving interoperability between formerly competing standards.
Multi-engine
- Exposes open interfaces such as a Hive Metastore-compatible API and the Iceberg REST API, enabling many processing engines to use Unity Catalog as their catalog.
- Allows tools like Trino clusters or PyIceberg notebooks to query data governed by Unity Catalog, not just Databricks Spark clusters.
- Targets becoming a universal data-access layer instead of being tied only to Databricks runtimes.
Multimodal
- Governs traditional tabular data, unstructured files, machine-learning models, ML experiment tracking, and even AI “functions” or tools as first-class entities.
- Provides a single namespace where all these asset types can be registered side by side and have access controlled consistently.
  ‍

What are the Key Features of Unity Catalog?

Feature	How Does this Benefit You?
Unified Governance Across Data & AI	One catalog to set fine-grained ACLs, audit every access, and trace lineage for tables, files, features, and models. So security and impact analysis stay consistent across every engine (SQL, REST, etc.).
Open Interfaces (Hive Metastore & Iceberg REST)	Speaks the standard HMS and Iceberg APIs, letting Spark, Flink, Trino, Presto, and other tools plug in with zero custom glue, avoids vendor lock-in while mapping Delta ↔ Iceberg metadata behind the scenes.
Multi-Format Data Support	Registers Delta, Iceberg, Parquet/CSV/JSON side-by-side, so you can migrate lakes or run mixed workloads without forced conversions; Unity stores unified metadata and exposes each format natively.
Credential Management & Secure Access	Centralizes storage auth with credential vending/passthrough and IAM federation; engines get least-privilege temp creds or pre-signed URLs, keeping data-plane access bounded by catalog policy.
Multimodal Asset Management	Treats MLflow models, feature tables, and LLM tools as first-class assets next to data, capturing full lineage (raw → table → feature → model) and enforcing role-based publish/consume controls.
Ecosystem & Collaboration	Backed by AWS, Azure, GCP, Nvidia, Salesforce, etc.; integrates with Delta Sharing for secure cross-org reads and can run standalone—positioning Unity as the open standard governance layer.

‍

In Conclusion

I reiterate that the catalog war is still alive. Through the course of this blog, we traced how Iceberg’s classic catalogs (Hadoop, Hive, and Glue) provide essential features: a pointer swap, leaving gaps around branching, cross-table consistency, and governance.

To go beyond this and add features that data engineers need is where modern catalogs such as Nessie, Polaris, Gravitino, Lakekeeper, and Unity make the cut. They speak the Iceberg REST dialect while layering Git-style version control, federated views, fine-grained policies, and multi-engine freedom.

Together, these emerging catalogs prove that a catalog is no longer just a phone book but a contract that lets every engine collaborate confidently on the same lake, with safeguards baked in and silos designed out. Whether you prize immutable histories, geo-scalable metadata, or a lean Rust binary you can run anywhere, the toolbox now exists, and the next move is yours. Choose boldly, experiment early, and let your lakehouse flourish!

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free

Frequently asked questions (FAQs)

How do I integrate e6data with my existing data infrastructure?

How does billing work?

What kind of file formats does e6data support?

What kind of performance improvements can I expect with e6data?

What kinds of deployment models are available at e6data ?

How does e6data handle data governance rules?

Available at

Blog Events Docs

Terms and Conditions Privacy Policy Cookie Policy

Back

Table of contents:

This is some text inside of a div block.

Listen to the full podcast

Apple Podcasts

Spotify

Share this article

Back

Iceberg Catalogs 2025: A Deep Dive into Emerging Catalogs and Modern Metadata Management

June 27, 2025

Ankur Ranjan

Engineering

Emerging Apache Iceberg Catalogs: Project Nessie, Apache Gravitino, Apache Polaris, Lakekeeper, and Unity Catalog

In June 2025, the Snowflake Summit 2025 and the Databricks Data+AI Summit 2025 shared many interesting developments and thoughts, which made me realize that the catalog war is still on.

Databricks’ recent launch of Unity Catalog GA support for Apache Iceberg has introduced an exciting new facet to its ecosystem.

What is a Catalog in Apache Iceberg?

It’s the central metadata repository that every query engine consults to know what an Iceberg table looks like and where to find its data.

Simply put, an Iceberg catalog is a metadata store that tracks all your Iceberg tables.

Apache Iceberg organizes data into three layers:

Data layer: Raw data files & delete files. Mostly, data & delete files are Parquet file formats. Sometimes in ORC or AVRO.
Metadata layer: Information about the Table metadata. It stores this information in the manifest files, manifest lists, and the root metadata.json that describes every file and snapshot.
Catalog layer: The “phone book” that tells every engine where the latest root metadata.json lives. It is primarily an external service that tracks tables and their metadata.
‍

‍

In practice, whenever you create, update, or drop an Iceberg table, it’s the catalog that records those changes.
For example, adding a new batch of data will create a new metadata JSON file for the table, and the catalog updates the pointer to reference this new metadata (so readers will see the new data in one consistent snapshot).
In a similar vein, dropping a table will remove its entry from the catalog. By handling these operations consistently, the catalog prevents orphaned metadata and keeps tables in a clean, correct state.
The catalog also enables multi-engine access because it behaves as a neutral store of metadata; multiple processing engines (e6data, Spark, Flink, and Trino) can concurrently read and write to Iceberg tables through the same catalog without stepping on each other’s toes.
Some catalog implementations even support transaction locking or other mechanisms to coordinate multiple writers.

Iceberg is flexible about where this catalog lives. Broadly, there are two categories of catalogs in use today :

File-based catalogs – The simplest approach: using a filesystem or object store to keep track of tables. The built-in Hadoop Catalog falls in this category. It doesn’t require any separate service – metadata is stored in files on disk.
Service-based catalogs – These use an external service or database to store table metadata. Examples include using a Hive Metastore, AWS Glue Data Catalog, a JDBC database (such as PostgreSQL or MySQL), or newer, purpose-built services like Project Nessie. They usually add richer features (transaction coordination, security hooks, branching, or cross-table commits) and are the go-to choice for production. Many of the emerging catalogs are trying to follow the Iceberg REST specification.

Here are two of our existing blogs to understand more about catalogs.

‍

Why is a Traditional Catalog Not Enough?

‍

Hadoop Catalog

Pain Point	What Does it Mean in Practice?
Data-loss risk	DROP TABLE in a Hadoop catalog deletes data and metadata by default. This means there’s no “external” flag, so an accidental drop can wipe the lake.
Scalability	Commands such as SHOW TABLES trigger directory scans. On S3, each “list” is an API call, so thousands of tables make the listing crawl.
Rigid warehouse path	One catalog ⇢ one root directory. Want two buckets? You spin up two catalogs, which get clunky fast.
Filesystem quirks	Iceberg relies on atomic rename. HDFS and ADLSv2 have it; S3 does not, so concurrent writers can collide unless you bolt on a DynamoDB-based lock table. Workarounds (like using DynamoDB) add external dependencies.

‍

Hive Metastore

Pain Point	What Does it Mean in Practice?
Limited transactions and governance	HMS predates “data-as-code” approaches. It can atomically swap one table pointer, but cannot coordinate multi-table commits or row-level governance without add-ons (Ranger, Lake Formation, etc.).
Requires extra service to run	You need the metastore service plus a backing DB. Many teams offload that to AWS Glue just to avoid ops fatigue.
Scale quirks bleed into Iceberg	HMS locks can become a bottleneck under heavy concurrent writes; performance also hinges on how beefy (or not) the backing DB is.

‍

AWS Glue Data Catalog

Pain Point	What Does it Mean in Practice?
No multi-table transactions	Like Hive, Glue supports only supports atomic updates to one table at a time. It lacks a transaction coordinator or two-phase commit, so updates to multiple tables (e.g., two Iceberg tables) can result in partial commits if a failure occurs mid-process. This is problematic for ETL workflows needing consistency across multiple tables.
AWS-only	Great inside AWS; awkward if you’re hybrid or multi-cloud. High-frequency metadata ops can feel slower than a co-located, self-hosted catalog.
Metadata lag on new Iceberg features	Glue tracks core schema/partition info but still lacks branching, tagging, or cross-table commit APIs.

‍

Which are the Exciting New Age Catalogs?

Project Nessie

This introduces a bunch of features for data management:

You can isolate experimental changes on a branch (e.g., dev or staging branches for a dataset).
Confidently test these changes.
Only merge the changes to the main branch when ready, while retaining the ability to time-travel or revert as the commit history is tracked.

‍

So, How Does Nessie Work?

Branches and tags are just references: Nessie is “Git for data.” Each branch (e.g., main, feature_x) is nothing more than a pointer to a specific commit, giving you an isolated view of every Iceberg table. When you insert data on feature_x, Nessie writes a new Iceberg metadata file and records a commit on that branch—the pointer on main stays unchanged until you merge.
Commits are metadata-only. An INSERT writes a new Iceberg metadata file, then Nessie moves the branch pointer to that file in one optimistic-concurrency step.
Back-ends are pluggable. For local trials, you can run the default in-memory or JDBC store; for prod, you might choose Postgres (as in the Docker compose above) or RocksDB/Cassandra for distributed scale
Standard APIs. Because Nessie now speaks the Iceberg REST protocol, any engine or library that implements that spec—Spark 3.5, Flink 1.19, Trino 448, PyIceberg, even DuckDB, can point at http://<nessie-server-address>:19120/api/v2 and work.
‍

What are Nessie’s Key Features?

Feature	How Does this Benefit You?
Git-like branches/tags	CREATE BRANCH (SQL or API) forks the current state; MERGE BRANCH brings changes back. Gives catalog-level version control and time travel.
Cross-table commits	A single commit ID can update many Iceberg tables at once: no half-written state, perfect for multi-table ETL.
Zero-copy isolation	Each branch is a consistent view; the main branch never sees partial changes until you merge
Multi-engine support	Works with e6data, Spark, Flink, Trino, Dremio, etc., via Iceberg REST or native plugins
Auth plugins (OAuth2/OIDC)	Drop Nessie behind your IdP / API gateway for RBAC and audit logging.
CLI & Python SDK	pip install pynessie gives a command-line client and Python automation.
Mature OSS community	Active since 2020, production-hardened in Dremio Arctic and elsewhere

‍

Enough of theory. Let’s install it in our local system and play with it to understand more.

Hands-on with Project Nessie

Now that you’ve read about Nessie and what it can do, let’s get hands-on and get a real-world run with Nessie.

docker-compose.yaml

version: "3.9"
volumes:
  warehouse:
  nessie-db:
networks:
  demo: {}
services:
  # 1️⃣  Postgres 14 – Nessie commit log
  nessie-db:
    image: postgres:14
    environment:
      POSTGRES_DB: nessie
      POSTGRES_USER: nessie
      POSTGRES_PASSWORD: nessie
    volumes: [ nessie-db:/var/lib/postgresql/data ]
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "nessie"]
      interval: 10s
      retries: 5
    networks: [ demo ]
  # 2️⃣  Nessie server
  nessie:
    image: ghcr.io/projectnessie/nessie:0.104.1
    # platform: linux/amd64   # ← uncomment on Apple-silicon
    depends_on:
      nessie-db:
        condition: service_healthy
    environment:
      QUARKUS_HTTP_PORT: 19120
      NESSIE_VERSION_STORE_TYPE: JDBC2
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://nessie-db:5432/nessie
      QUARKUS_DATASOURCE_USERNAME: nessie
      QUARKUS_DATASOURCE_PASSWORD: nessie
    ports: [ "19120:19120" ]
    networks: [ demo ]
  # 3️⃣  Spark 3.5 + Iceberg 1.9 + Jupyter
  spark:
    image: tabulario/spark-iceberg
    depends_on: [ nessie ]
    ports:
      - "8888:8888"   # Jupyter
      - "4040:4040"   # Spark UI
    volumes:
      - warehouse:/opt/warehouse
      - ./conf/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro
      - ./notebooks:/home/iceberg/notebooks:rw           # ← notebooks persist
    environment:
      SPARK_DRIVER_MEMORY: 1G
    networks: [ demo ]

‍

Run this code for a hands-on application of what we’ve done so far. Don’t forget to refer to my comments in the code.

# ── Boot a Spark-SQL session that already knows the Iceberg + Nessie extensions
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# ── 1. Create a demo namespace + table on the *main* branch
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo")

spark.sql("""
    CREATE TABLE demo.sales (
        id  INT,
        amt DOUBLE
    ) USING iceberg
""")                                    # ⇒ metadata-00000.json

# insert one row on main → commit A
spark.sql("INSERT INTO demo.sales VALUES (1, 9.99)")
spark.sql("SELECT * FROM demo.sales").show()   # 1 row

# ── 2. Fork the catalog state
spark.sql("CREATE BRANCH dev IN nessie")       # pointer dev → commit A
spark.sql("USE REFERENCE dev IN nessie")       # work on the branch

# add a second row on dev → commit B
spark.sql("INSERT INTO demo.sales VALUES (2, 19.95)")
spark.sql("SELECT * FROM demo.sales").show()   # 2 rows (dev only)

# ── 3. Switch back to production and verify isolation
spark.sql("USE REFERENCE main IN nessie")
spark.sql("SELECT * FROM demo.sales").show()   # still 1 row

# ── 4. Merge the dev work into main
spark.sql("MERGE BRANCH dev INTO main IN nessie")  # fast-forward main → B
spark.sql("REFRESH TABLE demo.sales")              # force Spark to reload metadata
spark.sql("SELECT * FROM demo.sales").show()       # main now shows 2 rows

‍

Apache Gravitino

‍

How Does Apache Gravitino Work?

Provides a technical data catalog and metadata lake that plug into multiple underlying metadata stores through extensible connectors.
Can pull in or directly manage metadata from sources such as Hive Metastore, relational database catalogs, and file systems, exposing everything through a unified API.
Manages these systems directly, so updates in Gravitino or in an underlying source are reflected immediately in the other, eliminating lag or mismatches common in periodic-sync catalogs.
Features a geo-distributed architecture: instances deployed in different regions or clouds coordinate to share metadata, giving users a global, continuously synchronized catalog across data centers.
Builds on Apache Iceberg’s table format and REST catalog concepts, with plans to extend support to other formats like Hudi and Delta Lake.
Strives to act as a universal metadata hub for varied table formats and AI/ML assets.
‍

What are Gravitino's Key Features?

Feature	How Does this Benefit You?
Unified Metadata Management	Provides a unified relational-table model (Hive, MySQL, PostgreSQL) and file-based model for unstructured data (HDFS, S3), abstracting Hive Metastore, object stores, streaming catalogs under one API.
End-to-End Data Governance	Centralized access control, auditing, and data discovery; policies enforced across Spark, Trino, etc., ensuring uniform compliance for warehouses, lakes, and AI datasets.
Federated and Geo-Distributed Architecture	Connector framework spans multiple back-ends (e.g., Hive Metastore, Snowflake) and regions; instances sync to maintain a global namespace while serving low-latency local queries.
High-Performance Metadata Store	Built-in key-value cache and optimized commit logic deliver low-latency, high-concurrency metadata operations, overcoming Hive Metastore’s slowness.
Pluggable & Extensible Modules	Modular extensions for authentication (OAuth2, LDAP), authorization, metrics, event listeners; supports custom monitoring and lineage triggers.
Multi-Engine & Multi-Format Support	Iceberg REST catalog already integrates with Trino; Spark, Flink, Hudi, Delta Lake, and AI/ML asset cataloging on the roadmap for a single multimodal catalog.

‍

Apache Polaris

‍

How Does Polaris Work?

Acts as a RESTful server that tracks Iceberg table metadata and schema definitions.
Client engines (e.g., Spark, Trino) interact with Polaris through REST API calls to read or write Iceberg tables.
Applies metadata changes atomically (such as committing new snapshots) ensuring ACID guarantees for table operations.
Designed for vendor neutrality: data can reside on any cloud object storage (S3, GCS, Azure Blob, etc.), with an abstraction layer that hides metadata location and format from engines.
Supports namespaces and catalogs to organize tables logically; a single deployment can host multiple catalogs.
Operates in two table modes:
- Internal tables: fully managed by Polaris with full read/write access.
- External tables: managed by another catalog (e.g., Hive, Glue); Polaris treats them as read-only mirrors.
Federation capability lets you gradually migrate from other metastores (e.g., start with Glue or Snowflake in read-only mode, then shift management to Polaris).
Enables hybrid deployments and avoids forcing an all-or-nothing migration path.
‍

What are Apache Polaris’ Key Features?

Feature	How Does this Benefit You?
Iceberg REST API & Multi-Engine Support	Implements the Iceberg REST catalog spec; any Iceberg-aware engine connects via the REST endpoint. All reads/writes flow through Polaris, giving ACID transactions and consistent views. Multiple engines can simultaneously query and update the same Iceberg tables while always seeing a consistent state.
Built-in Governance & RBAC	Snowflake-inspired governance with role-based access control at catalog, namespace, and table levels. Central policy enforcement means no per-tool duplication. Credential vending issues temporary cloud-storage credentials so engines can access only authorized data. This means no long-lived keys are required.
Vendor-Neutral & Flexible Deployment	Polaris is a community-driven project (contributors include Stripe, IBM, Uber, and Starburst). Runs on any infrastructure - on-prem, any cloud, Kubernetes, and any cloud storage. No dependency on Snowflake’s platform, preventing vendor lock-in and keeping metadata portable.
Catalog Federation (Hybrid Mode)	Registers external catalogs (Hive Metastore, AWS Glue, etc.) as read-only sources while Polaris-managed catalogs remain fully transactional. Enables gradual migration to a unified lakehouse without disrupting existing systems.
Robust Cloud Storage Support	Configurable storage backends for S3, Azure Blob, Google Cloud Storage, and more. Catalog tables across clouds; Polaris handles differing APIs and security. Credential vending plus unified governance secures multi-cloud access for all engines.

‍

Lakekeeper

Despite its lighter footprint, Lakekeeper fully implements the standard Iceberg REST API, meaning it can be used by any engine that speaks that API, similar to Polaris and Gravitino.

‍

How Does Lakekeeper Work?

Architecture mirrors other Iceberg REST catalogs, featuring a server component that clients register with.
Spark or Flink jobs invoke Lakekeeper REST endpoints to create or update Iceberg tables; Lakekeeper manages commit logic and metadata storage.
Internally, metadata is stored in a relational (normalized) database—either an embedded store or an external SQL database.
This design lets Lakekeeper query and update metadata efficiently and add capabilities (e.g., metadata analytics) without scanning object-storage metadata files.
Compared with a pure “on-file” catalog like Hadoop Catalog, Lakekeeper can more easily provide global statistics, fast lookups, and multi-table transactions.
Server-side commits run inside a single database transaction, enabling true atomic commits that span multiple tables—something traditional metastore catalogs (e.g., Hive) lack.
Distributed as one compiled binary, so it deploys smoothly on a laptop for local tests or in Kubernetes for production.
Includes a web UI and a Helm chart for Kubernetes, simplifying initial setup.
‍

What are Lakekeeper's Key Features?

Feature	How Does this Benefit You?
High-performance, Single-Binary Server	Fast startup, low memory footprint, effortless containerization, horizontal scalability. Rust codebase (no JVM), stateless compute nodes, relational metadata model, “no unsafe code,” Helm chart & built-in UI
Complete Iceberg REST Compatibility	Drop-in catalog for Spark, Flink, Trino, e6data, etc.; multi-table atomic commits & time-travel. Full Apache Iceberg REST API, cross-table transactions, eliminates Hive Metastore / AWS Glue limitations
Fine-grained Access Control and Governance	Centralized, consistent permissions across all query engines. OAuth2/OIDC auth, Open Policy Agent bridge, OpenFGA backend, check API, audit log & change-event hooks
Secure Data Access and Storage Management	Least-privilege object-store access without sharing long-lived keys. Vended credentials & signed URLs for S3, short-lived scope-limited tokens
Automation and Maintenance	Hands-off optimization and lifecycle management for Iceberg tables. Planned background compaction, snapshot expiration, ecosystem-first Rust SDK/CLI, rapid feature adoption

‍

Databricks Unity Catalog

How Does Unity Catalog Work?

Unity Catalog supplies a central metadata and permission store unified along three dimensions: multi-format, multi-engine, and multimodal.

Multi-format
- Catalogs data in Delta Lake (Databricks’ native table format) as well as Apache Iceberg, Parquet, CSV, and others.
- Introduces UniForm, which presents Delta tables in an Iceberg-compatible way so Iceberg clients can read Delta files, achieving interoperability between formerly competing standards.
Multi-engine
- Exposes open interfaces such as a Hive Metastore-compatible API and the Iceberg REST API, enabling many processing engines to use Unity Catalog as their catalog.
- Allows tools like Trino clusters or PyIceberg notebooks to query data governed by Unity Catalog, not just Databricks Spark clusters.
- Targets becoming a universal data-access layer instead of being tied only to Databricks runtimes.
Multimodal
- Governs traditional tabular data, unstructured files, machine-learning models, ML experiment tracking, and even AI “functions” or tools as first-class entities.
- Provides a single namespace where all these asset types can be registered side by side and have access controlled consistently.
  ‍

What are the Key Features of Unity Catalog?

Feature	How Does this Benefit You?
Unified Governance Across Data & AI	One catalog to set fine-grained ACLs, audit every access, and trace lineage for tables, files, features, and models. So security and impact analysis stay consistent across every engine (SQL, REST, etc.).
Open Interfaces (Hive Metastore & Iceberg REST)	Speaks the standard HMS and Iceberg APIs, letting Spark, Flink, Trino, Presto, and other tools plug in with zero custom glue, avoids vendor lock-in while mapping Delta ↔ Iceberg metadata behind the scenes.
Multi-Format Data Support	Registers Delta, Iceberg, Parquet/CSV/JSON side-by-side, so you can migrate lakes or run mixed workloads without forced conversions; Unity stores unified metadata and exposes each format natively.
Credential Management & Secure Access	Centralizes storage auth with credential vending/passthrough and IAM federation; engines get least-privilege temp creds or pre-signed URLs, keeping data-plane access bounded by catalog policy.
Multimodal Asset Management	Treats MLflow models, feature tables, and LLM tools as first-class assets next to data, capturing full lineage (raw → table → feature → model) and enforcing role-based publish/consume controls.
Ecosystem & Collaboration	Backed by AWS, Azure, GCP, Nvidia, Salesforce, etc.; integrates with Delta Sharing for secure cross-org reads and can run standalone—positioning Unity as the open standard governance layer.

‍

In Conclusion

Listen to the full podcast

Apple Podcasts

Spotify

Share this article

FAQs

What is an Apache Iceberg catalog, and why do I need one?

It’s the external metadata store that tracks every Iceberg table’s schema, partitioning, and current snapshot. The catalog keeps a single pointer to the latest `metadata.json`, letting any engine (Spark, Trino, Flink, etc.) read or write the table with ACID guarantees and time-travel. Without it, the lake is just files.

Why are traditional catalogs insufficient for modern lakehouses?

Lakehouses now demand Git-style branching, multi-table atomic commits, multi-cloud deployments, and fine-grained RBAC. Classic catalogs weren’t built for these needs, so teams face data-loss risk, slow directory scans, S3 rename quirks, and limited transaction scope—prompting the rise of newer REST catalogs that fill those gaps.

Which emerging catalog should I choose for cross-table transactions?

If you need guaranteed cross-table atomicity, choose catalogs designed for it—Project Nessie and Lakekeeper both wrap multiple table updates in a single commit. Traditional Hadoop, Hive Metastore, and AWS Glue can only commit one table at a time, risking partial state when ETL touches several tables.

What unique features does Project Nessie add to Iceberg cataloging?

Project Nessie layers Git-like branches and tags over Iceberg. Every change is stored as a commit in a pluggable backend, and a single commit can touch many tables. That lets you experiment on an isolated branch, merge when ready, and guarantees cross-table atomicity missing from HMS or Glue.

View All Posts

Engineering

July 18, 2025

Sweta Singh

Procedural Power, Set-speed: Inside e6data’s Froid-inspired UDF Engine

Sweta Singh

July 18, 2025

Engineering

June 11, 2025

Adishesh Kishore

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

Adishesh Kishore

June 11, 2025

Engineering

June 6, 2025

Rajath Gowda

Solving Geospatial Analytics Performance Bottleneck: H3 vs Quadkey

Rajath Gowda

June 6, 2025

View All Posts

Available at

Blog Events Docs

Terms and Conditions Privacy Policy Cookie Policy

Iceberg Catalogs 2025: A Deep Dive into Emerging Catalogs and Modern Metadata Management

What is a Catalog in Apache Iceberg?

Why is a Traditional Catalog Not Enough?

Hadoop Catalog

Hive Metastore

AWS Glue Data Catalog

Which are the Exciting New Age Catalogs?

Project Nessie

So, How Does Nessie Work?

What are Nessie’s Key Features?

Hands-on with Project Nessie

Apache Gravitino

How Does Apache Gravitino Work?

What are Gravitino's Key Features?

Apache Polaris

How Does Polaris Work?

What are Apache Polaris’ Key Features?

Lakekeeper

How Does Lakekeeper Work?

What are Lakekeeper's Key Features?

Databricks Unity Catalog

How Does Unity Catalog Work?

What are the Key Features of Unity Catalog?

In Conclusion

View more articles

Build future-proof data products

Frequently asked questions (FAQs)

FAQs

FAQs

Related posts

Related posts