Data-driven applications must be optimized for the edge

The expansion of technologies like 5G and edge computing have dramatically increased what’s possible for businesses and government organizations. Developers need to consider these trends when developing new applications to ensure that data and processes remain safe and secure.

As business data is increasingly produced and consumed outside of traditional cloud and data center boundaries, organizations need to rethink how their data is handled across a distributed footprint that includes multiple hybrid and multicloud environments and edge locations.

Business is increasingly becoming decentralized. Data is now produced, processed, and consumed around the world – from remote point-of-sale systems and smartphones to connected vehicles and factory floors. This trend, along with the rise of Internet of Things (IoT), a steady increase in the computing power of edge devices, and better network connectivity, are spurring the rise in the edge computing paradigm.

IDC predicts that by 2023 more than 50% of new IT infrastructure will be deployed at the edge. And Gartner has projected that by 2025, 75% of enterprise data will be processed outside of a traditional data center or cloud.

Processing data closer to where it is produced and possibly consumed offers obvious benefits, like saving network costs and reducing latency to deliver a seamless experience. But, if not effectively deployed, edge computing can also create trouble spots, such as unforeseen downtime, an inability to scale quickly enough to meet demand and vulnerabilities that cyberattacks exploit.

Stateful edge applications that capture, store and use data require a new data architecture that accounts for the availability, scalability, latency and security needs of the applications. Organizations operating a geographically distributed infrastructure footprint at the core and the edge need to be aware of several important data design principles, as well as how they can address the issues that are likely to arise.

Map out the data lifecycle

Data-driven organizations need to start by understanding the story of their data: where it’s produced, what needs to be done with it and where it’s eventually consumed. Is the data produced at the edge or in an application running in the cloud? Does the data need to be stored for the long term, or stored and forwarded quickly? Do you need to run heavyweight analytics on the data to train machine learning (ML) models, or run quick real-time processing on it?

Think about data flows and data stores first. Edge locations have smaller computing power than the cloud, and so may not be ideally suited for long-running analytics and AI/ML. At the same time, moving data from multiple edge locations to the cloud for processing results in higher latency and network costs.

Very often, data is replicated between the cloud and edge locations, or between different edge locations. Common deployment topologies include:

  • Hub and spoke, where data is generated and stored at the edges, with a central cloud cluster aggregating data from there. This is common in retail settings and IoT use cases.
  • Configuration, where data is stored in the cloud, and read replicas are produced at one or more edge locations. Configuration settings for devices are common examples.
  • Edge-to-edge, a very common pattern, where data is either synchronously or asynchronously replicated or partitioned within a tier. Vehicles moving between edge locations, roaming mobile users, and users moving between countries and making financial transactions are typical of this pattern.

Knowing beforehand what needs to be done with collected data allows organizations to deploy optimal data infrastructure as a foundation for stateful applications. It’s also important to choose a database that offers flexible built-in data replication capabilities that facilitate these topologies.

Identify application workloads

Hand in hand with the data lifecycle, it is important to look at the landscape of application workloads that produce, process, or consume data. Workloads presented by stateful applications vary in terms of their throughput, responsiveness, scale and data aggregation requirements. For example, a service that analyzes transaction data from all of a retailers’ store locations will require that data be aggregated from the individual stores to the cloud.

These workloads can be classified into seven types.

  • Streaming data, such as data from devices and users, plus vehicle telemetry, location data, and other “things” in the IoT. Streaming data requires high throughput and fast querying, and may need to be sanitized before use.
  • Analytics over streaming sata, such as when real-time analytics is applied to streaming data to generate alerts. It should be supported either natively by the database, or by using Spark or Presto.
  • Event data, including events computed on raw streams stored in the database with atomicity, consistency, isolation and durability (ACID) guarantees of the data’s validity.
  • Smaller data sets with heavy read-only queries, including configuration and metadata workloads that are infrequently modified but need to be read very quickly.
  • Transactional, relational workloads, such as those involving identity, access control, security and privacy.
  • Full-fledged data analytics, when certain applications need to analyze data in aggregate across different locations (such as the retail example above).
  • Workloads needing long term data retention, including those used for historical comparisons or for use in audit and compliance reports.

Account for latency and throughput needs

Low latency and high throughput data handling are often high priorities for applications at the edge. An organization’s data architecture at the edge needs to take into account factors such as how much data needs to be processed, whether it arrives as distinct data points or in bursts of activity and how quickly the data needs to be available to users and applications.

For example, telemetry from connected vehicles, credit card fraud detection, and other real-time applications shouldn’t suffer the latency of being sent back to a cloud for analysis. They require real-time analytics to be applied right at the edge. Databases deployed at the edge need to be able to deliver low latency and/or high data throughput.

Prepare for network partitions

The likelihood of infrastructure outages and network partitions goes up as you go from the cloud to the edge. So when designing an edge architecture, you should consider how ready your applications and databases are to handle network partitions. A network partition is a situation where your infrastructure footprint splits into two or more islands that cannot talk to each other. Partitions can occur in three basic operating modes between the cloud and the edge.

Mostly connected environments allow applications to connect to remote locations to perform an API call most – though not all – of the time. Partitions in this scenario can last from a few seconds to several hours.

When networks are semi-connected, extended partitions can last for hours, requiring applications to be able to identify changes that occur during the partition and synchronize their state with the remote applications once the partition heals.

In a disconnected environment, which is the most common operating mode at the edge, applications run independently. On rare occasions, they may connect to a server, but the vast majority of the time they don’t rely on an external site.

As a rule, applications and databases at the far edge should be ready to operate in disconnected or semi-connected modes. Near-edge applications should be designed for semi-connected or mostly connected operations. The cloud itself operates in mostly connected mode, which is necessary for cloud operations, but is also why a public cloud outage can have such a far-reaching and long-lasting impact.

Ensure software stack agility

Businesses use suites of applications, and should emphasize agility and the ability to design for rapid iteration of applications. Frameworks that enhance developer productivity, such as Spring and GraphQL, support agile design, as do open-source databases like PostgreSQL and YugabyteDB.

Prioritize security

Computing at the edge will inherently expand the attack surface, just as moving operations into the cloud does.

It’s essential that organizations adopt security strategies based on identities rather than old-school perimeter protections. Implementing least-privilege policies, a zero-trust architecture and zero-touch provisioning is critical for an organization’s services and network components.

You also need to seriously consider encryption, both in transit and at rest, multi-tenancy support at the database layer, and encryption for each tenant. Adding regional locality of data can ensure compliance and allow for any required geographic access controls to be easily applied.

The edge is increasingly where computing and transactions happen. Designing data applications that optimize speed, functionality, scalability and security will allow organizations to get the most from that computing environment.

Karthik Ranganathan is founder and CTO of Yugabyte.


This article was written by Karthik Ranganathan and Yugabyte from VentureBeat and was legally licensed through the Industry Dive Content Marketplace. Please direct all licensing questions to