Feature

How big data and distributed systems solve traditional scalability problems

The highly centralized enterprise data center is becoming a thing of the past, as organizations must embrace a more distributed model to deal with everything from content management to big data. Here we examine how technologies like Hadoop and NoSQL fit into modern distributed architectures in a way that solves scalability and performance problems.

Jason Tee

Published: 06 Feb 2013

It's rare to see an enterprise that relies solely on centralized computing. But there are nevertheless still many organizations that do keep a tight grip on their internal data center and eschew any more distribution than is absolutely necessary. Sometimes, this is due to investments in existing infrastructure. At other times, it is due to security concerns that arise from a risk averse culture. However, centralization is becoming less and less feasible due to a number of unavoidable factors:

The number and variety of client devices increases year over year, creating an increasingly complex array of end points to serve
The amount and variety of data collected continues to expand exponentially with the use of social, mobile, and embedded technology
The need to mine this data for business insights becomes imperative in a competitive marketplace
The requirements of continuous development and deployment create a demand for systems that are highly componentized for greater agility and flexibility (SOA)
The cost of scaling internally to provide the computing resources to keep up with demand while maintaining acceptable performance levels becomes too high to handle from both an administrative and infrastructure standpoint
Having a potential single point of failure is unacceptable in an era when decisions are made in real time, loss of access to business data is a disaster, and end users don't tolerate "downtime"

So how does embracing a more distributed architecture address the aforementioned issues? Different aspects of the distributed computing paradigm resolve different types of performance issues. Here are just a few examples:

Peer pressure is a good thing

The peer-to-peer distributed computing model ensures uninterrupted uptime and access to applications and data even in the event of partial system failure. Some vendor SLAs guarantee high availability with 99% uptime or higher, a feat which few enterprises can match using centralized computing. Automated failover mechanisms mean end users are often unaware that there is even a problem since communication with the servers is not compromised. What about latency issues? SLAs may also be customized with specific performance metrics for response time and other factors that align with business objectives.

The sky is the limit

The "virtually" unlimited scalability of cloud computing provides the ability to increase or decrease usage of infrastructure resources on demand. Instant and automated provisioning and de-provisioning of servers and other resources allow enterprises to perform better by ensuring that end user access to applications keeps up with simultaneous, resource intensive demand – even when traffic spikes unexpectedly.

Data is a big deal

The use of distributed systems also has implications for "Big Data". Theadvent of NoSQL options provides an opportunity for enterprises to bifurcate their data stream to accept and fully utilize both relational data via SQL DBs and non-relational data with DB options such as MarkLogic and MongoDB. Arnon Rotem-Gal-Oz, Architecture Director at Nice Systems, points out that SQL still has the edge when it comes to reporting functionality, security and manageability. On the other hand, he admits, "If you have scale problems that are hard or expensive to solve with traditional technologies, NoSQL fills these needs in ways that you didn't have before."

Implementing native applications that run on thick clients relieves servers of some of their workload and can deliver a faster and more user-friendly experience (assuming there isn't a need to update data frequently between the client and the server). Using a tiered structure that divides responsibilities among web, application and data servers can permit organizations to outsource any of these processes or layers that can be handled most effectively by a third party vendor. This type of multi-tiered distributed computing can also be used to lessen the burden on internal servers even when deploying applications for thin clients such as mobile devices.

Bargain basement pricing

Large scale distributed virtualization technology has reached the point where third party data center and cloud providers can squeeze every last drop of processing power out of their CPUs to drive costs down further than ever before. Even an enterprise-class private cloud may reduce overall costs if it is implemented appropriately. The number of vendors in the cloud arena is still growing, leading to more and more competitive pricing arrangements. In addition to lowering costs, relieving the administrative burden from internal IT personnel may free up resources for developing applications that improve performance in other ways.

Versatility in technology choices

A distributed architecture is able to serve as an umbrella for many different systems. Hadoop is just one example of a framework that can bring together a broad array of tools such as (according to Apache.org):

Hadoop Distributed File System that provides high-throughput access to application data
Hadoop YARN for job scheduling and cluster resource management
Hadoop MapReduce for parallel processing of big data
Pig high-level data-flow language for parallel computation
ZooKeeper high performance coordination service for distributed applications

This framework may be of special interest to enterprises since some very bright minds are working on commercialization projects at Yale University right now in concert with Hadapt. Dr. Daniel Abadi believes that "Hadoop is going to make it to the next level. We saw a lot of adoption in 2012. Now it's about trying to figure out the 'perfect' Hadoop use cases. So, building some vertical-specific applications is going to be a pretty big trend for 2013." Those use cases that increase distributed computing and business performance will be trailblazers to watch.

Are your architectures becoming more or less centralized? Let us know what you think. And then follow TheServerSide on Twitter (@TSS_dotcom)

How big data and distributed systems solve traditional scalability problems

Dig Deeper on Software development best practices and processes

Top 35 big data interview questions with answers for 2025

Hadoop Distributed File System (HDFS)

Hadoop

Hadoop vs. Spark: An in-depth big data framework comparison