We need to store rich text used for web, apps, UIs and editable in editor.
Long time ago, it started as pure HTML in DB. When we wanted to split editing and printing, we opted for JSON.
But now we want to manipulate data inside effectively. And there, we decided to create a binary format.
It proved to be faster, smaller and easier to change.
Background Of Our Rich Text Editing
Editor and data #1 - HTML
Long, long time ago, as most other developers, we were using TinyMCE, then the other one (there were two “good” editors 15 years ago, right?). HTML was plug-and-play, so HTML goes in, edited goes out, stores in DB and prints as is. That worked for first few years of my career.
Editor and data #2 - JSON
Then we needed to limit & expand. Sounds opposing to each other, that’s because we were changing two things in one update.
First was that those editors were trying to mimic Word - giving users all possible options. Do you know what happens if you give a user colors? They will use them. For a carefully designed and curated website and app, that looks highly untrustworthy. And it simply looks bad. We could limit the options, but then what’s the point of such a huge library?
Second idea was to extend functionality into galleries, FAQs, even custom content blocks. But every project is different while editor stays. We also wanted one stored rich text to survive graphical changes.
So, JSON representation of data was born inside a completely custom editor.
Boy, the first iteration sucked! It was hard to make an editor back then. And it’s not easy today because web browsers Selection API is just horrible. But we can talk editors another time. This is about data format.
Editor and data #3 - BINARY
Here we are now. JSON wasn’t enough anymore. Nowhere to optimize building and parsing speeds. When we need to go through all the texts to search&change something (typically they are old links), we had to rebuild the whole array, parse and stringify the whole thing. That is a pain to work in, slow and highly inefficient for every purpose. Why wait for something for 5 seconds if you can wait 5 minutes.
The data manipulation, that tipped us over.
And because we are expecting large contents in near future that must be streamable, custom and binary it is.
Binary In PHP
PHP and other scripting languages are here for high-level thinking. You don’t want to think of memory, of bytes, not usually.
Neither PHP nor JavaScript is optimized for that. But they have functions for managing bytes. Not once I said “I am missing Zig” while working on encoding.
PHP has:
pack()
unpack()
pack() makes pure bytes from your number or string. unpack() takes those bytes and gives you back your number or string.
The Unexpected And What I’ve Learned
Before starting, I knew PHP could handle it. Just never read up on “how”.
I come from the Czech Republic, and Czech is full of UTF-8 characters. Therefore I am used to using and seeing a lot of “mb_” prefixes.
But here - you don’t want “mb_” or multibyte string operations. You literally want to read bytes one by one or by specific chunks. For example, a chunk for your unsigned 64-bit is 8 bytes. This is very common in the code:
substr($chunk, $last_position, 8);
What I am saying is - you want to be precise about reading bytes.
Talking Between Client And Server
I am not against JSON. I like JSON. But it fucked me with this project.
I didn’t expect that PHP’s
json_encode();
json_decode();
don’t like non-UTF8 bytes. It just throws you an error.
We are using JSON to send data from server down to client and the other way around. They stopped coming.
The solution was to encode those fields in Base64. I am not happy about it. I am also not unhappy enough to change the whole communication we have now. We are planning on updating it anyway in the future.
Flat Is Best
This is not really a binary thing. But a good advice anyway.
Structured documents feel very… structured. They are usually nested and so we feel like we have to or just want to keep that representation alive.
It’s not usually a good choice. Having a flat structure alongside very simple tree is almost always a good call.
I opted for “elements” - a flat list of texts, tags, lists etc. and how to combine the elements into a deep HTML is kept just as tree of offsets.
In vast majority of cases, you don’t need a tree. You need to find the "a"s, see just a plain text, things like that. Full parsing happens rarely.
JSON vs. BLOB
Bytes just win.
It’s in fact not that difficult to make a custom binary format. I just recommend to everybody doing it - document the final format. This is one of the cases you really want a proper specification. It’s genuinely harder and more annoying to read this kind of code later when you forgot how the format is conceptualized.
Old encoding into structure: 2.1847579479218 s
Old decoding from structure to html: 0.86596989631653 s
New encoding into structure: 1.198970079422 s
New decoding from structure to html: 0.67016983032227 s
These results show encoding and decoding of old JSON and new binary format. Each looped 10 000 times and the numbers are consistent.
We already created fairly simple functions for manipulation. Changing a href? Easy. Grabbing text only? Easiest. Cleaning up a trash HTML is 30-lines function now.
We expect this new format to be used for a long time. Surely versioned and enhanced in time. And later used in our bare-metal software. Putting this into Zig? Damn, that thing won’t hit even microsecond.
I already know this was a good decision.
Took a week of work to make it properly. Definitely worth it!
One downside - it’s larger. 3x of the final HTML, 2x of the JSON. We do server-rendering, therefore non-issue here.