Now, this post is a post of celebration :sparkles:. It has a bit of marketing feel to it, so if you just prefer to read some code, go ahead to our GitHub repo! Otherwise, read on :book:

Why Should I be Interested in flatdata?

There are plenty libraries and services for handling your data: numerous (R)DBMS, key-value storages, serialization libraries and protocols, and many of them target high performance. How is flatdata different?

Access Patterns

What happens if you need to traverse hundreds of thousands of entities per incoming request? What if for each entity you need to do extra lookups in the data? What if the entities are relatively small (few dozen bytes), but exhibit poor locality? Using conventional databases can work, but could stop scaling very quickly.

flatdata provides tools to implement efficient data storage that enables such use-cases. While structuring the data is still done by developer, flatdata provides implementations for common patterns that make the job simpler.

Efficient Storage

What if your data consists of millions of entities? How about billions? What if the data is as rich as hundreds of potentially assigned attributes for each entity? Storing things efficiently makes for a smaller footprint, saves bandwidth and allows to better utilize system caches, directly affecting performance.

flatdata allows to operate on bit level without losing benefits of structured data access. If the attribute is boolean, why use a byte for it? If the value range is [0, 211) why not use spare five bits at the end of the field?

Structure

Finally, if your data is rich, you might want to give it some structure. flatdata allows to define data structure using a schema language, which can also be used for documenting the data format, as well as to avoid writing boilerplate code.

What is flatdata?

Quoting concise definition from the readme file:

Flatdata is a library providing data structures for convenient creation, storage and access of packed memory-mappable structures with minimal overhead possible. Structures may contain bitfields which will be serialized in a platform-independent manner.

flatdata includes:

  • Schema DSL for defining data layout and versioning
  • Code generator to target development languages
  • Set of libraries for each target language
  • Set of tools

flatdata Philosophy

flatdata is designed as efficient write-once read-many data storage. It does not look or feel like conventional databases. Instead, think of it as a framework to build efficient data sources:

  • To store data, flatdata provides efficient serializers.
  • flatdata is immutable: once archives are created, they cannot be altered.
  • flatdata archives are backward-compatible only as long as their schema stays the same.

When is That Useful?

We designed the library that way to support following patterns (which just happened to be just what we needed :wink:):

  • Your data updates infrequently (say, every few hours. But as long as update rate is much smaller than serialization time, you should be set).
  • You can afford to recreate the full flatdata archive every time you need to do structure update.
  • Your data can be efficiently serialized into a mid-sized archive (~50GB? ~200GB? As long as data size is comparable to the amount of RAM on the machine, you should be fine.)
  • Your data is going to be accessed many times (substantially more often than it is updated).
  • You want to optimize your data to be cache-friendly: sort things as often as possible, minimize size of indices, group data that is used together, store everything else separately.

What Can I Store in Flatdata?

flatdata does not impose any particular way of structuring the data. So, mostly anything. That said, following patterns work best:

  • Structures with categorical and numerical data (look for basic types and structures)
  • Implicit or explicit references between data structures. One to one, one to many, many to many. Pretty much as in any DBMS, up to developer to implement via associative vectors.
  • Low-frequency attributes, which can be assigned to only a small subset of a large number of entities (see multivector).
  • Arbitrary metadata that can be attached to entities (see raw data).

What is behind flatdata?

  • flatdata is based on files. flatdata uses custom platform-independent structure alignment (one byte) and stores data in little-endian order.
  • Reading and writing data is implemented in efficient C++. Templates, inlining and compiler optimizations yield only few instructions per data access.
  • When accessing flatdata, no data is accessed or copied, until the point you use it. When you use it, you read only the bytes you need. Memory-mapped files and page cache takes care of everything else.
  • When writing to flatdata, one can create full collection of structures in memory and dump it to disk or efficiently build up large collection while keeping fixed memory footprint.

Schemas, Archives and Resources

flatdata entry point is an archive. An archive is defined by its schema. Schema defines the data structures and is stored along with the created archives. Archive schema must be identical to one used in software, otherwise opening archive will fail. A simple schema can look like this:

namespace loc {
    struct Point {
        x : u32 : 32;
        y : u32 : 32;
    }
    archive Locations {
        pois : vector< Point >;
    }
}

Schema defines:

  • Structures - smallest units of information that can be stored in an archive.
  • Resources - collections of structures grouped together.
  • Archives - collections of resources that can be used as a whole.

Given a schema, one can generate C++ or Python code from it. And use it in their application to build flatdata and use flatdata archives. For more details, have a look at a readme file, examples and flatdata reference.

C++ and Python

flatdata right now supports two languages, but can (and probably will) be extended further:

  • C++ - provides efficient serializers and library. Is the first class citizen, as it is a main implementation our services use.
  • Python - provides library to conveniently access arbitrary flatdata archives. Efficient enough to process smaller archives and for inspection/debugging.

Next Steps

We are very excited that flatdata is open source now. We use it heavily in production and plan to extend it further, including new language support, better documentation, new functionality and support for new use-cases. We have already invested lots of our free time into it and I hope we made something that Open Source community can benefit from. Compared to mature Open Source Projects, we’ve got a way to go, but one step at a time. And now, first step done.

We would love to hear from you. If you have any feedback - let us know. If you are going to use flatdata, and want to share that - let us know - we will be really happy. Most importantly, should you decide to contribute, don’t hesitate to create pull request! We are friendly and will always try to respond in timely manner :bowtie:.

References

Credits

Icons used in this post are designed by Smashicons from Flaticon

Leave a Comment