ZODB is a powerful native object database for Python, widely known for its use in the Zope web framework and the Plone content management system. By enabling transparent object graph persistence with no need to predefine schemas, ZODB enables extremely flexible application development. With pluggable storage engines such as FileStorage, ZEO, and RelStorage, it also provides flexible ways to store data.

This post provides an introduction to ZODB, focusing on some of the lower-level mechanics of storing data. This post doesn't discuss persistent objects.

Disclaimer

This was written in support of the RelStorage 3.0 release so it may be biased in that direction. It is not an exhaustive list of all storage options. For example, it doesn't discuss NEO, a distributed, redundant storage. Partial lists of included and non-included storages may be found in the ZODB storage documentation.

What is ZODB?

ZODB [1] is a native object database for Python, enabling transparent object persistence. It provides the illusion of an infinite memory space holding application-defined objects. That memory space is shared between processes running at different times on the same or different machine. Only those objects actually used are brought into physical memory. Think of it something like operating system paging, but for objects, and distributed across time and space. (Apple's CoreData framework has a similar technique it calls "faulting".)

In addition, ZODB provides a transactional view of these objects with snapshot isolation. Any given connection to the database sees a consistent view of all the objects in the database (whether it reads or writes to any particular object or not) as-of the moment it began. When adding or updating objects, no changes are published and made visible to other connections until the writing connection commits its transaction, at which point either all the changes are made visible or none of them are. Existing connections that continue reading (or even writing!) will still not see those changes; they're "stuck" at the snapshot view of the objects they started with. (The ability for readers to continue to be able to retrieve old data that's been replaced in newer transactions is known as multi-version concurrency control, or MVCC.)

Many connections may be reading and writing to the database at once. ZODB uses optimistic concurrency control. Readers don't block other readers or writers, and writers are allowed to proceed as if they were the only one making changes right up until they commit. Writes are defined to occur in a strict order. If a writer discovers that an earlier transaction had modified objects that it too wants to modify, a conflict occurs. Instead of just rolling back the writing transaction and forcing it to start over, taking the modified object into account, ZODB gives the application the chance to resolve the conflict using a three-way merge between the object as it existed when the transaction began, the object that the connection wants to commit, and the object that was committed by the other writer. Only if it cannot do so is the transaction rolled back.

What is a ZODB storage?

ZODB uses a pluggable storage architecture, allowing different ways to store the objects it manages. Storage engines are responsible for allocating persistent object identifiers (OIDs) for each object ZODB manages, storing object state data [2] when an object is added or changed [3], and later retrieving the data for that particular object given its OID. The storage is also responsible for implementing snapshot isolation, ordering (serializing) writes and assigning incrementing transaction identifiers (TIDs), and detecting and handling conflicting writes.

FileStorage

Out of the box, in addition to a few different transient (in-memory) storage engines, ZODB comes with one persistent (on-disk) storage engine. FileStorage uses a single file to store an append-only transaction log for all the objects in the database. An additional in-memory and on-disk structure is used to record the relationship between objects (OIDs) and the transactions (TIDs) they appear in.

As an append-only file, writing to FileStorage can be quite fast. It requires memory (and extra storage space) proportional to the size of the database to record object positions for fast access. If that extra index data isn't saved to disk, it requires time proportional to the size of the database to scan the file on startup to re-create that index.

Because of its append-only nature, previous versions of objects are still found in the file and can be accessed by providing a proper TID. A FileStorage is thus said to be "history preserving." That's how snapshot isolation is implemented: each connection is explicitly associated with a TID and when it needs to read an object it asks the FileStorage to provide the revision of the object most recently written before that TID [4]. This can also be used like a version control system to view and even recover or undo changes to objects. Periodically, a FileStorage must be "packed" to remove obsolete historical data and prevent the file from growing forever.

FileStorage is widely deployed and has a long history of stability. It can only be used by a single process at a time, however. Within that process, only a single thread can be in the process of committing a transaction at a time (FileStorage uses a database-wide lock to provide serialization).

ZEO

A common method to extend access to a FileStorage to more than one process and/or to more than one machine is to deploy a ZEO [5] server. ZEO uses a client/server architecture. The server process opens one or more storages (in practice, always a FileStorage [6]) and exposes a network API to provide access to this storage. Client processes connect to this server and send it read and write requests. The server mediates access to the underlying storage for the clients.

ZEO inherits many of the strengths and weaknesses of its underlying storage and adds some of its own. For example, clients can be configured with a persistent local cache for cheap access to common objects or even read-only access when the server isn't available. But the central ZEO process has to contend with Python's GIL, which may limit scalability, and it defaults to resolving conflicts by loading application code into the server process, which can complicate deployments due to the need to keep client and server processes all running compatible code.

ZRS [7] is a storage wrapper implemented in Python and commonly wrapped around a ZEO storage that provides replication of data.

What is RelStorage?

RelStorage is a ZODB storage engine that's meant to solve many of the same problems as ZEO and ZRS, but taking a different approach with a different set of tradeoffs. RelStorage uses a relational database—MySQL, PostgreSQL, Oracle, or SQLite—to provide the final storage for object state data. It pushes the responsibility for OID allocation, locks, transaction management and snapshot isolation, and replication down to these systems.

The next section is mostly a copy of RelStorage's own description of its features. It makes references to ZEO and FileStorage described above.

Features

  • It is a drop-in replacement for FileStorage and ZEO, with several enhancements:
    • Supports undo, packing, and object history preservation just like FileStorage.
    • RelStorage can be configured not to keep object histories for reduced disk space usage and improved performance.
    • Multiple processes on a single machine can read and write a local ZODB database using SQLite without needing to start and manage another process (i.e., ZEO).
    • Blobs can be stored on a shared filesystem, or (recommended) in the relational database and only cached locally.
    • Multiple threads in the same process share a high-performance in-memory pickle cache to reduce the number of queries to the RDBMS. This is similar to ZEO, and the ZEO cache trace tools are supported.
    • The in-memory pickle cache can be saved to disk and read when a process starts up. This can dramatically speed up site warmup time by eliminating a flood of RDBMS queries. Unlike ZEO, this cache is automatically shared by all processes on the machine (no need to configure separate client identifiers.)
  • Ideal for large, high volume sites.
    • Multiple Python processes on multiple machines can read and write the same ZODB database concurrently. This is similar to ZEO, but RelStorage does not require ZEO.
    • Supports ZODB 5's parallel commit feature: Database writers only block each other when they would conflict (except for a small window at the end of the twophase commit protocol when the transaction ID is allocated; that still requires a global database lock).
    • According to some tests, RelStorage handles concurrency better than the standard combination of ZEO and FileStorage.
    • Whereas FileStorage takes longer to start as the database grows due to an in-memory index of all objects, RelStorage starts quickly regardless of database size.
    • Capable of failover to replicated SQL databases.
  • Tested integration with gevent for PostgreSQL and MySQL.
  • There is a simple way (zodbconvert) to (incrementally) convert FileStorage to RelStorage and back again. You can also convert a RelStorage instance to a different relational database. This is a general tool that can be used to convert between any two ZODB storage implementations.
  • There is a simple way (zodbpack) to pack databases.
  • Supports zodburi .
  • Free, open source (ZPL 2.1)

MVCC and History Free Storage

One thing in particular I'd like to highlight is that RelStorage can implement snapshot isolation and conflict resolution without preserving history. To do this, it relies on the RDBMS's native implementation of MVCC, the repeatable read isolation level, and the read committed isolation level.

When a transaction begins, a RDBMS transaction is opened on a connection at the repeatable read (or higher) level. This connection is used for loading data from the database. This isolation level causes the RDBMS to establish its own snapshot view of the database as-of that moment of time.

A second connection is used to write data to the database. This connection is in the lower isolation level of simply read committed. This level ensures that each query it makes to the database returns the latest committed data. Objects being written are first placed in a temporary table; they are moved to their final table (overwriting an old revision for history free storages) only after any possible conflicts have been found and resolved.

The difference in the two connections' isolation levels matters specifically because of conflict resolution, as does the use of a temporary table. Recall that resolving conflicts needs three versions of the object: the object that existed when the transaction began (the original object), the object that is currently committed and was changed by someone else (the committed object), and the object that the writer would like to store (the new object). The task of the conflict resolution is to find the delta between the original object and the new object and apply those same changes to the committed object. This produces a new object to store which will become the committed object.

Ignoring caches, the only place that original object can come from is that load connection at repeatable read isolation level. By definition, any fresh connection or transaction that looked at the database now would see the currently committed object (or something even later)—the original object has been overwritten and that change committed, so it's gone. RelStorage relies on the underlying database to keep it visible to the load connection.

Likewise, getting the currently committed object requires a connection that can read the current state of the database. That's where the second connection comes in. It can see the current data in the database.

Q & A

Why two connections? Why not put the data in the temporary table, commit, and begin a new transaction to update the current view of the database?

Because that would lose access to the original object.

Why not preemptively store off all the original objects somewhere (e.g., download them or copy them to a temp table) before committing?

Because ZODB uses an optimistic concurrency model. We assume that conflicts are few and far between. If that's true, that would be doing a bunch of extra work that we don't usually need to do. Remember, there's no way to know if there's going to be a conflict or not without a current view of the database.

Well then, why not just have a single shared connection for the current view of the database and use it to check for conflicts and only then save the original objects that have conflicts?

Because that connection wouldn't know what objects to check for conflicts on. Those objects are already in the database in temporary tables that are connection specific and unreadable to a different connection. We'd have to pass a list of object IDs back to the database, and not all databases support array operations to do that efficiently. Or we'd have to write to a persistent table, which doesn't sound appealing (we'd have to arrange to delete from it too.)

Also, because RDBMS connections aren't thread-safe, that would introduce a per-process lock into the commit process.

Still, perhaps that's worth looking into more.

Couldn't a history-preserving database implement snapshot isolation just like FileStorage and use only one read committed connection?

Quite possibly, yes. That could make for some moderately ugly or inefficient SQL queries though.

SELECT *
FROM object_state
WHERE zoid = :zoid
AND tid <= :tid
ORDER BY tid DESC
LIMIT 1

Why temp tables? Why not store directly to the final table?

For history free databases, the final table is where we get the data to resolve conflicts, so we can't overwrite it.

For history preserving databases, we don't yet have the necessary transaction ID we need to store to the final table. (The primary key is (OID, TID), and the TID is a foreign key reference to another tables as well).

We could allocate the TID earlier, before storing temporary data, but that defeats much of the benefit of ZODB 5 parallel commit.

We could use a fake TID and update it in-place, but altering primary keys tends to be expensive.

Conflict Resolution

RelStorage supports conflict resolution. Conflict resolution is performed in each individual process in a distributed fashion. There's no central server that has to be updated with application code in order to resolve conflicts. ZEO 5 supports a similar feature.

Summary

ZODB is a flexible and powerful object database for Python, supporting transactions, optimistic concurrency, and conflict resolution. It uses a layered architecture with the definition and serialization of individual objects handled by the persistent library, the generic transactional API provided by transaction, and data storage and MVCC semantics provided by the pluggable storage layer.

ZODB comes with a storage implementation using an append-only file, as well as an in-memory dict-based storage plus a change-tracking demo storage. These are all restricted to a single process, but ZEO allows utilizing them from multiple processes.

RelStorage is a storage layer based on a SQL database, intended to be highly scalable.

Updates

  • Add additional links to more resources about included and third-party storages.

Footnotes

[1] ZODB may stand for "Zope Object Database," or it may stand for "Z Object Database."
[2] To the storage engine, the object state data is just an opaque sequence of bytes. In reality, the ZODB Connection uses Python's standard pickle protocol to serialize objects into bytes.
[3] Storages also handle non-object data in the form of BLOBs, each of which is associated with an object and assigned an OID.
[4] Actually, the layer that implements snapshot isolation on top of an arbitrary history preserving storage is found in the core of ZODB. This was one of the major changes in ZODB 5.
[5] Previously "Zope Enterprise Objects".
[6] Though one possibly wrapped in something like zlibstorage to provide compression or cipher.encryptingstorage to provide encryption.
[7] "ZODB Replicated Storage"