I’m a huge fan of databases, so much so that I’ve written a book on so-called “NoSQL” databases, I spent some of my most fruitful years in tech working on the highly influential distributed database Riak, and I even built a database called Purple last year just for fun.
Naturally, I’m always on the lookout for new and exciting developments in databases and DB-related tools when I scan (trash fires like) Twitter, Reddit, and HackerNews. In this post I’d like to talk about three recently minted database technologies that I find intriguing:
In part 2 I’ll cover three others:
In part 3 I’ll conclude with some closing thoughts. Note: I’ll be focusing exclusively on the core technologies and mostly ignoring things like enterprise features (where applicable).
My selection criteria are purely subjective. If there’s something you don’t see here that you think I should be checking out, tweet at me and let me know! My handle is @lucperkins.
TileDB is a DB built around multi-dimensional arrays that enables you to easily work with types that aren’t a great fit for existing RDBMS systems, such as dense and sparse arrays and dataframes. TileDB is specifically geared toward use cases like genomics and geospatial data.
- Swappable data backends with support for Amazon S3, MinIO, and others
- Storage integrations for HDFS, PrestoDB, Apache Spark, Dask, and others
- Language bindings for C/C++, Python, R, Java, and Go
What I like most
I’m a fan of “specialty” DBs like this that hone in on a specific set of data types and problems. The great thing about traditional RDBMSes is that they’re versatile enough to cover an extremely wide array of use cases (no pun intended) but sometimes you have “last mile” edge cases that are both (a) beyond the capabilities of “kitchen sink” systems and also (b) at the core of your business.
I expect to see the emergence of more systems like this as database use cases become ever more specialized and new problem domains emerge. The old guard RDBMSes aren’t going anywhere, of course, but it’s nonetheless encouraging to see TileDB and others pushing the envelope. What I’m really hoping for is the emergence of extremely “hackable,” resolutely non-monolithic DBs that provide a plugin interface for highly use-case-specific data types, but this is something I’ll talk about later.
Questions for the project
- How much work is done on the client library side versus the database side? Are the TileDB clients essentially language-specific math libraries for manipulating complex data types locally and occasionally saving the results to the desired backend or are they like other DB client libraries that mostly just relay commands to the database? It’s not entirely clear from the documentation.
- What’s the rationale for providing a key-value store given the plethora of existing K/V options? The docs even say that “TileDB is not designed to work as a special-purpose key-value store.” What’s the value of ratcheting on a feature like this?
Materialize touts itself on its website as “the first true SQL streaming database” and that actually may not be overblown! It’s essentially a relational database that’s wire compatible with PostgreSQL but with the crucial difference that it offers materialized views that are updated in real time.
I’ve also seen Materialize described as a streaming data warehouse, which seems fitting.
In standard Postgres, for example, you have to manually update materialized views:
CREATE MATERIALIZED VIEW my_view (/* ... */); REFRESH MATERIALIZED VIEW my_view; /* The underlying tables change */ REFRESH MATERIALIZED VIEW my_view; /* More stuff happens */ REFRESH MATERIALIZED VIEW my_view;
You can do this as frequently as you’d like, perhaps using a script or a cron job. What I have yet to see but always secretly wanted, however, is a database that natively supports incremental updates to materialized views. Yep, that’s right: Materialize listens for changes in the data sources that you specify and updates your views as those sources change.
Even if Materialize doesn’t “win” or stick around for very long, I suspect that the capabilities that it brings to the table are here to stay and will almost certainly be reproduced in DBs down the road.
- A diverse range of data sources, including other tables (as in standard Postgres), JSON, CSV, and other files, Kafka and Kinesis topics, and most likely many others in the future.
- The core engine is powered by two really powerful constructs: timely dataflow and differential dataflow. I won’t go over these here but I strongly recommend delving into those concepts yourself. The very academic-leaning creators of Materialize have been deeply involved with both projects, so you can trust that Materialize is the product not of a “hacker ethos” but rather an extremely careful, fastidious process.
- Because Materialize is wire compatible with Postgres, you can still use psql and all the other Postgres tools you’ve become accustomed to.
What I like most
Materialize has the potential to replace a lot of things. Most obviously, it lets you full-on jettison any existing processes you have for incrementally updating your materialized views. That’s a moderately sized win.
The much bigger win for me, however, is that Materialize lets you decommission those portions of your data stack devoted to listening for changes on your data sources. You can do things like this natively:
CREATE SOURCE click_events FROM KAFKA BROKER 'localhost:9092' TOPIC 'click_events' FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://localhost:8081';
Your DB is now “aware” of a data source that it can use to construct auto-updating materialized views. This native “piping” feels even more magical to me than auto-updating views. If you’re running, say, serverless functions or Heron jobs or Flink pipelines that do nothing more than listen here and write an
INSERT statement there, Materialize will enable you to unceremoniously cut out that part of your stack.
Questions for the project
- Why a separate DB and not a Postgres extension? I’m sure there are good architectural reasons for why an extension wouldn’t cut it here but I’d love to know what that is.
- How easy is it to build extensions for other data sources? How could I write, for example, an extension for Apache Pulsar? Do you plan on exposing an API for this to developers?
The Golden Path for using Prisma goes something like this:
- Define your application-level data types using Prisma’s schema SDL
- Generate highly idiomatic code for your language of choice out of your created schema
- Get busy creating REST APIs, GraphQL APIs, and whatever else you want to build
Overall, I do understand the hesitance that some might have toward something like Prisma. A large subset of devs don’t want the database abstracted away. They don’t want clever DSLs and GUIs. They want to write plain SQL and they want to handcraft all their DB interaction code. I completely understand the desire to retain that degree of control, but I would nonetheless encourage you to spend 20-30 minutes giving Prisma a try. I think it does a very good job of carefully sailing between the Scylla of too much sugar and the Charybdis of raw SQL.
- Prisma Client enables you to define your desired data types using a specialized schema SDL that generates code for you. Even the connection information for your database disappears from your application code and into the generated code.
- Prisma Migrate enables you to define DB migrations declaratively (rather than imperatively as with SQL). The messy details of how the DB gets from state
Bare hidden from view.
- Prisma Studio is a visual editor that essentially provides a GUI over the other Prisma tools.
What I like most
The Prisma schema DSL is a way of defining not only the data types you need for your application but also which code generator you want to use and connection information for the DB you’re connecting to.
Plus, it provides enums, convenient annotations like
@default. It’s reminiscent of both the DB migration DSLs from ActiveRecord and Ecto and also IDLs like those used by Protocol Buffers and Thrift.
I’d really love to see the Prisma schema SDL or something like it adopted as a language-neutral standard (consumable by language-specific libraries). The status quo is that every programming language reinvents the wheel. I think it’d be beneficial to have a language-agnostic way of defining those types that define the relationship between the application and the DB.
On a related note, a quick shoutout to Prisma’s exceedingly beautiful documentation. I absolutely consider this a feature and if your docs don’t belong in the “what I like most” section of a blog post then you should invest more resources into tech writing.
Questions for the project
- Would Prisma also be useful in languages that already have widely used ORMs? If I’m using ActiveRecord for Ruby or Ecto for Elixir, what’s my incentive to switch?
- This is awkward, but why is Prisma a company? I don’t know how to pay you money and I don’t know what that money would be for 🤷🏼♂️ I like free stuff so I’ll take it but I’d love to hear where this is all going.