Distributed data – how to pick the right approach
Distributed data means different things to different people. For enterprise IT professionals, it will involve databases and pipelines. For those in the world of Web3, data is decentralised and based on blockchain technology. Both approaches can be valuable. But are they the right ones to solve the business problem that you have? Here are some key points to consider.
Not all distributed data is created equal
A common theme here is how you think about distributing your data as it grows. Over time, your data set will get so large and so important that you will have to spread it around. This means that you can continue to scale that data set up, but it also provides protection against that data being deleted, corrupted or otherwise lost.
This does present some problems. How should you approach this, and what trade-offs are involved? How can you look at what you think are the most important issues, and are they really the problems that you need to solve?
Traditional database theory for distributed data follows the CAP Theorem developed by Eric Brewer, where your approach can deliver two of the following: consistency, availability and partition tolerance, but not all three. With distributed data, you can have consistency and availability, but not partition tolerance, or partition tolerance and availability but then not immediate consistency. With databases running across multiple locations, using CAP could help you pick the right platform for your application and workload.
This has become more important as companies look at using cloud services to host data workloads. For instance, you can choose to run your database across a cloud provider’s locations, such as AWS Availability Zones in different locations. Alternatively, you could run your database across your own data centre and a cloud provider, or across multiple cloud providers at the same time. Whichever approach you pick, your data is physically distributed across multiple locations to guard against loss or failure.
For Web3 developers, CAP is less relevant. Instead, the challenge that these developers want to solve is around how to use data effectively when the transactions themselves take place and are trusted, but the overall environment is distributed with no central point of control or authority. Instead, services like blockchain work on consensus across the whole chain of storage and where all those participating can compare their records against each other. By having consensus shared by all nodes, developers can build a trusted set of data that is available publicly and can be trusted independently.
Both approaches start from an initial premise – how to think about distributed data. But by taking a step back and looking at the business problem that you want to solve, you can pick the right approach to distributed data, or even the right combination of distributed data technologies that you need.
What can different developers learn from each other?
For enterprise developers, the purpose for blockchains may be heavily linked to cryptocurrency. However, blockchains have more uses than virtual currencies. One of the most important premises behind blockchain deployments is where you want indelible proof for transactions over time. In practice, this can be applied to trace items through a supply chain and provide proof that a package is what it says it is.
A good example of this would be medicines or other sensitive consumables. For companies that face problems around fake goods and fraud, being able to track each product through the supply chain without relying on any one provider to manage and update that tracking data. However, blockchains cannot scale to the same transaction performance that more traditional databases can deliver. This is down to how they function with proof of stake transactions required across the entire blockchain network to be completed, compared to single transactions with a database.
For Web3 developers, distributed data is normally equated with blockchain. However, just because data can be stored in a blockchain does not mean that it is the right approach for every implementation within an application. While a public ledger is right for data that needs that public consensus, blockchain is not suitable for analytics or real-time applications. Distributed databases are more suitable when you must carry out real-time transactions or want to get analytics quickly.
Implementing analytics on this kind of data can be achieved by taking a copy of the blockchain and transferring it into a format that is better suited for analysis tasks like a database. The challenge here is how to get that connection working so that any new transactions that take place on the blockchain are then automatically streamed through to update the analytic data set too and achieve this without any complex indexing or extract-transform-load (ETL) operations.
Think business, not technology
For developers working in a particular area, it can be tempting to try and apply your favourite technologies to every problem. However, this is not the most efficient or effective way to solve those problems over time. By looking at the issues that you are trying to deal with dispassionately and applying the whole array of different techniques around data, you are more likely to succeed.
Selecting the right approach around data to fit the problem will involve linking distributed data using different approaches where they are most suitable – distributed databases are best suited for workloads that need to process large volumes of transactions or carry out complex analytics in real time, while distributed ledgers are better suited to tasks where data provenance and trustworthiness are needed.
Subscribe to our Editor's weekly newsletter