Key Takeaways:
Every database deals with text data, and strings often dominate data workloads. In analytical queries, string processing can become a major bottleneck. Traditional string representations (like typical null-terminated C strings or heap-allocated objects) carry significant overhead: extra metadata, pointer indirections, and poor cache locality. Over millions of rows, those inefficiencies add up. To unlock faster insights, modern analytics engines require a more cache-friendly, compact way to handle text. This is why we’re excited about German Strings, an approach that restructures how strings are stored in memory to dramatically boost string performance. At e6data, we’ve implemented this optimization in our real-time analytics engine as a new feature to supercharge text-heavy queries.
The quirky name comes from academic lineage. The concept was first described in a 2018 paper from TUM (Technical University of Munich) and implemented in their Umbra database system. Andy Pavlo jokingly dubbed them “German Strings” in homage to their German origin, and the nickname stuck. Many wondered if it relates to handling German characters (umlauts, ß, etc.), but it doesn’t – it’s purely about memory layout, not locale or character encoding. (For the record, German Strings fully support Unicode (UTF-8) text like any other string type). The success of the idea has made it widespread, appearing in systems like DuckDB, Apache Arrow (DataFusion), Polars, and Meta’s Velox engine. This broad adoption underlines that German Strings aren’t a niche experiment, but a proven technique now everywhere in analytics databases.
At the heart of German Strings is a fixed 16-byte string struct design. In those 16 bytes, the string’s essential info is stored in one of two ways, depending on length: short-string or long-string format. For short strings (≤ 12 bytes), the struct stores the string inline. The first 4 bytes are used to record the string’s length, and the remaining up to 12 bytes directly contain the string characters. There’s no separate heap allocation at all – the text lives right inside the struct. For example, a string like e6data(6 bytes) fits entirely within the 16B structure, including its length field.
For longer strings (> 12 bytes), the struct acts as a smart reference. It still begins with a 4-byte length, but instead of storing the whole content, it stores a prefix and a pointer substitute. A copy of the string’s first few characters (typically the first 4 bytes) is embedded as a prefix, and the struct uses the remaining space to hold two 32-bit values that locate the full string data elsewhere. In our implementation (which aligns with Apache Arrow’s StringView), those two 32-bit fields serve as a buffer index and an offset. They tell us which memory buffer holds the rest of the string and where it starts, instead of an actual raw pointer. By splitting an 8-byte pointer into two 4-byte indices, we can reference string data spread across potentially multiple buffers, useful for memory management and zero-copy data loading in columnar formats. Crucially, we still keep a 4-byte prefix of the string in the struct itself as a quick reference to its beginning.
This also pairs really well with dictionary encoding. Fixed-size views can point into a shared dictionary buffer, and the prefix enables quick dictionary lookups without dereferencing.
This also enables us to try out a bunch of interesting (and potentially more performant) things on the existing tasks. For example, you can basically now treat the 4 bytes as an integer and do a radix sort on the strings, and only do a fallback when necessary.
German Strings is a small change that pays off in a big way. By switching to a 16-byte, prefix-inclusive format, we trim memory overhead and speed up common string operations. If you work in data engineering, you know strings are everywhere, so better string performance shows up across filters, joins, and sorts. At e6data, adopting German Strings lets our real-time analytics engine move faster without asking you to change a thing.
This approach is no longer just a research idea. You can see versions of it in DuckDB and in Apache Arrow’s StringView, and the results hold up under pressure. In our own testing, plans that used to grind through full strings now resolve early using the inline string prefix, which improves cache locality and cuts needless work.
The big picture is simple. German Strings help our engine treat text more like numbers. You get quicker queries, lower latency, and leaner memory use, especially when workloads hammer on equality checks and prefix comparisons. We’re rolling this out across the e6data platform and pairing it with other practical wins, like better hash tables for string keys. Sometimes performance comes from perfecting the basics. Here, it comes from 16 well-used bytes.
Why German Strings are Everywhere
A Deep Dive into German Strings
Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet
Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations