Data Optimization is The Key to Unlocking Blockchain Scalability

26 Oct 2024

Blockchain optimization is one of the most popular topics of discussions in Web3, but there’s an important optimization strategy that doesn’t get nearly as much attention as it should, and that’s data compression. For the past decade, there have been many attempts to build blockchains in ways that can handle increased volumes of data, but very few, if any, were also designed to make the data itself more manageable. 

This is the approach that Somnia has taken, building a blockchain that is prepared to handle large volumes of data, while also ensuring that the process is as efficient as possible. There are actually many different points in the design process where data is optimized. 

Let’s start with our consensus algorithm, Multistream Consensus. We will be describing this consensus algorithm in detail in a future article, but on a very high level, this is essentially where each validator runs its own data-chain to increase throughput on the blockchain. Multistream consensus has a variety of benefits, but the one that we are going to focus on for this conversation is that it enables something called streaming compression. 

Streaming Compression

The most common type of compression is called “Block Compression,” and it’s frequently used in traditional software like zip files. In block compression, all of the data is included in the compressed file, which is simpler to use as it only needs that one compressed file to recover the original data.

Streaming compression, on the other hand, assumes that both the sender and receiver share the same history of compressed and decompressed data. This allows the sender to refer back to earlier data, which leads to better compression because they don’t have to resend everything. This process is not feasible for most existing blockchains because it requires the sender to stay on the same machine, and blockchains typically rotate who is proposing new data.

However, Somnia has a consensus and data availability algorithm designed to support streaming-based compression. Each validator is responsible for publishing their own stream of data to their own blockchain. These are the data chains related to Multistream Consensus that we discussed earlier. The fact that data chains use the same process for publishing this stream unlocks the ability for streaming compression. 

Power Law Distribution

Our advanced compression is possible because data contained in blockchain transactions follow a tight power law distribution, and our design takes advantage of that. 

A power law distribution is a pattern where a small number of things in a system have a big impact, while most average things have a smaller impact. In other words, a smaller set of accounts or activities make up the majority of the total, and most of the rest contribute much less. Examples of this are seen in language, where a small number of words make up most of the words spoken or written, but there are many words that are rarely used.

In relation to blockchain transactions, a power law distribution means that a small number of contracts or wallets handle most of the transaction activity, while the majority of contracts and wallets have only a few transactions. This is a common pattern in many systems, where a small group of “super-users” or “whales” dominate the activity.

In practice, the probability distribution of which account is making a transaction, or which contract is being executed, or the arguments to the method being called, is sharp. This means that a minority of those accounts or contracts are more likely to occur in the data to be sent. For example, if a particular contract was being called by 10% of transactions, its address can be encoded in 3.3 bits. That is a 48x compression ratio on the uncompressed 20 byte address. So if a contract with the address “0x23924838398498492” is called frequently, it can be labeled with a small number of bits in the transaction.

Signature Aggregation

The power law distribution does not apply to every aspect of a blockchain transaction, particularly when it comes to hashes and signatures, so a different approach needs to be taken in these areas. Hashes and signatures completely change with a uniform distribution if any bit in the transaction is different, making them completely incompressible, as no two executed blockchain transactions are identical due to nonces, which avoid replay attacks.

Hashes are straightforward: they can be recalculated based on the transaction data, so they don’t need to be sent. The receiving client can generate them.

Signatures are more difficult. They must be sent with each transaction and can’t be compressed. However, we can use the BLS signature scheme to solve this. BLS signatures allow us to combine multiple signatures into one, reducing the size regardless of how many signatures there are.

Somnia uses BLS signature aggregation for signature verification speed and, crucially, because it enables a far better compression ratio. Using large batches this reduces the cost of signature verification by many orders of magnitude.

Novel Compression

We have thought deeply about data optimization and compression in the design of Somnia, and this article is just scratching the surface at the unique innovations that we have leveraged to achieve speeds of over 400,000 transactions per second. If you want to dig deeper into the tech, read our updated litepaper and let us know what you think through our community channels. 

Stay updated with Somnia

TwitterDiscordTelegramRedditLightpaperOnePager