Comparison of JSON Like Serializations – JSON vs UBJSON vs MessagePack vs CBOR

Recently I’ve been working on some extensions to ASEXOR, adding there direct support for messaging via WebSocket and I use JSON for small messages that travels between client (browser or standalone)  and backend.  Messages looks like these:

messages = [
    {'call_id': 1, 'kwargs': {}, 'args': ['sleep', 0.1]},
    {'call_id': 1, 't': 'r', 'returned': 'd53b2823d35b471282ab5c8b6c2e4685'},
    {'call_id': 2, 'kwargs': {'utc': True}, 'args': ['date', '%d-%m-%Y %H:%M %Z']},
    {'call_id': 2, 't': 'r', 'returned': '77da239342e240a0a3078d50019a20a0'},
    {'call_id': 1, 'data': {'status': 'started', 'task_id': 'd53b2823d35b471282ab5c8b6c2e4685'}, 't': 'm'},
    {'call_id': 2, 'data': {'status': 'started', 'task_id': '77da239342e240a0a3078d50019a20a0'}, 't': 'm'},
    {'call_id': 1, 'data': {'status': 'success', 'task_id': 'd53b2823d35b471282ab5c8b6c2e4685', 'result': None, 'duration': 0.12562298774719238}, 't': 'm'},
    {'call_id': 2, 'data': {'status': 'success', 'task_id': '77da239342e240a0a3078d50019a20a0', 'result': '27-02-2017 11:46 UTC', 'duration': 0.04673957824707031}, 't': 'm'}
    
]

I wondered, if choosing different serialization format(s) (similar to JSON, but binary) could bring more efficiency into the application –  considering  both message size and encoding/decoding processing time.  I run small tests  in python 3.5 (CPython and PyPy)  (see tests here on gist) with few established serializers, which can be used as quick replacement for JSON and below are results (updated Dec 2nd 2017 thanks to comment below, as situation changed a bit with new libraries versions):

Format Total messages size (bytes) Processing time
10000 x encoding/decoding all messages
PyPy 3
JSON (standard library) 798 789 ms  706 ms
JSON (ujson) 798 181 ms  3.14 s
MessagePack (official lib) 591 286 ms  314 ms
MessagePack (umsgpack) 585 435 ms  519 ms
CBOR 585 164 ms  313 ms
UBJSON 668 292 ms  406 ms

As messaging can use clients in web browser we can also look at performace  of some serializers in Javascript on this page.  As JSON serialization in part of browsers Web API, unsurprisingly it’s fastest there.

All alternative libraries are faster then standard library JSON, some improved significatly form previous tests ( UBJSON and umsgpack).  Standard library implementation of JSON serializer can be easily replaced by better performing ujson package. In PyPy interpreter standard library JSON is doing a slightly better, however every other library is performing worse,  notably ujson.

Conclusions

JSON is today really ubiquitous, thanks to it’s ease of use and readability.  It’s probably good choice for many usage scenarios and luckily JSON serializers show good performance.   If size of messages is of some concern, CBOR looks like great, almost  instant replacement for JSON, with similar performance in Python ( slower performance in browser is not big issues as browser will process typically only few messages)  and 27% smaller messages size.

If size of messages is big concern carefully designed binary protocol ( with Protocol Buffers for instance) can provide much smaller messages ( but with additional costs in development).

5 thoughts on “Comparison of JSON Like Serializations – JSON vs UBJSON vs MessagePack vs CBOR”

  1. Thanks for the great blog post.

    Another dataformat worth checking out is Smile.
    For a similar example I get these numbers.

    JSON: 744 bytes
    Smile: 470 bytes
    CBOR: 600 bytes
    Msgpack: 586 bytes

    The reason why Smile is much smaller is the built in back reference feature. Formats like json, cbor and msgpack have the problem that they have to send the key name with every field. In your example json, cbor and msgpack all contain the string ‘call_id’ 8 times in the output. But smile only writes this string once and then adds a reference in all the other locations. When you send a lot of similar objects this can save a lot of bandwith.

    Text from Wikipedia: https://en.wikipedia.org/wiki/Smile_(data_interchange_format)
    Compared to JSON, Smile is both more compact and more efficient to process (both to read and write). Part of this is due to more efficient binary encoding (similar to BSON, CBOR and UBJSON), but an additional feature is optional use of back references for property names and values. Back referencing allows replacing of property names and/or short (64 bytes or less) String values with 1- or 2-byte reference ids.

    1. Frankly spoken at least in Python it does not seem to be good alternative – I tried pysmile – It’s 2.7 only, last commit 2 years ago, performance very bad (17 secs for same tests), message size 618 bytes.

  2. Your test data seems to contain some binary data in hexadecimal form (e.g. “returned” and “task_id”). If you stored these in byte form, you’d benefit from them only requiring half the space compare to their hex string representation and any binary-type supporting formats would produce a considerably smaller encoded size.

    Also, it would appear you didn’t run your tests with the ubjson C-extension compiled. With it, performance should be comparable with (if not slightly better than) Python’s built-in json module (assuming version 0.10.0 or later).

    1. Thanks – updated with recent results. task_id and returned are basically UUIDs, so string representation is easiest, but I agree in wire protocol they can be converted to bytes, if one wants to save space. Anyhow messages are rather arbitrary, I just got them at hand when doing tests. Any other messages can be easily tested in provided IPython notebook.

Leave a Reply to John Hasting Cancel reply

Your email address will not be published. Required fields are marked *