Three things you didn't know about big data and Amazon Web Services (AWS)
By Jason Tee
As data gets bigger, Amazon Web Services (AWS) is positioning itself to help enterprises leverage the deluge of information to create more business value at a lower cost. Hadoop, DynamoDB and Elastic MapReduce get a lot of play in discussions of how AWS is making it easier for companies to manage, store and analyze their data. But there are also some hidden gems that are buried a little deeper under the surface. Here are three that could help organizations do business better.
The most valuable thing about the AWS library is that it gets you to consider where else you might look for free data.
Jason Tee, Enterprise Software Architect
Big data comes free with the Amazon cloud
When enterprises think about big data, they tend to focus mostly on the information they are collecting directly from interactions with customers. More forward-thinking business people might also recognize the value that can be extracted from data collected by partner organizations. But what about all the big data that is already available out there, free of charge, for anyone who cares to mine it?
AWS offers a library of free datasets that boggle the mind. Much of the information is of limited business value. For example, few users likely have a (legal) reason to browse the Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar "Chemdawg". Datasets that do potentially have a commercial use for global enterprise might be Japan’s economic census data or the facts available about millions of topics in the Freebase Data Dump. The information is typically a few years old, so it’s not exactly real-time. But if you know how to query it correctly, you could potentially save money on consulting research. The most valuable thing about the AWS library is that it gets you to consider where else you might look for free data.
AWS will school you on big data
Amazon Web Services is giving back to the developer and business community as a sponsor of Big Data University. So far, IBM is dominating this venue. However, Amazon is making a play to attract more students for its own courses using giveaways for AWS. This online university offers a variety of courses free of charge. The Hadoop tutorials seem to be very popular, garnering testimonials such as this one from newbies like 'Roman': "The course is excellent because it saves time from reading big books to learn Hadoop. I prefer agile practice: try to achieve small results ASAP. I didn't know anything about Hadoop two months ago. But these two months were enough for me to create 7 nodes Hadoop cluster..."
For those with a little more big data savvy, there's even the opportunity to write and post an online course. That could be a smart play for enterprises that have a good developer team working on their big data. Creating a tutorial is an excellent way to ensure knowledge retention if a valued team member moves on to another job. Posting the course online for others to take increases the pool of available replacement talent.
You can buy analyzing power for big data on the spot
Once you’ve provisioned the resources necessary to collect and store your data, you still have to spend more money to actually analyze it. Running any type of meaningful analytics is going to take computing power. For some types of data, it makes sense to perform analysis every hour, or even minute by minute. For other big data sets, doing analytics is only required occasionally. That might be daily, seasonally, after a big event (on a global scale or just one that shakes up your specific industry vertical), or when you acquire a large amount of new data. Fortunately, AWS lets you burst your business analytics with EC2 just as you do any other form of "as a Service" offering.
How does this work? At any given time, AWS has some resources that aren’t being fully used. Like empty cabins on a cruise ship, you can purchase access at a discounted rate. These resources are available at spot prices. You bid, with a cap for the maximum rate you are willing to pay per hour, and get to spin up your additional instances to the cloud as long as no one outbids you. If you do get outbid or reach the limit of the available resource pool, your additional instances are terminated. This means spot instances aren’t the best option for mission critical analytics. However, they can still be used as part of a larger elastic consumption model.
AWS has quite a few case studies for businesses that use this approach for regular big data tasks to save money. As an example, FourSquare uses spot instances to perform analytics across more than 3 million daily check-ins from users. They knew they would be using a lot of resources for this task on an ongoing basis, but didn’t want to pay full price. So, they chose a task that can be interrupted occasionally without causing a massive problem. AWS cautions users to ensure their applications have a high fault tolerance to get the most out of spot instances.
Let us know what other special services does your company use for big data and AWS?
15 Jun 2013