TheServerSide had an interesting conversation with Samantha Kosko a little while ago. She's one of the field marketing folks working for 10gen, the people behind MongoDB, so she's not completely unbiased when it comes to the world of data persistence, but her insights into what was going on over at Craigslist with regards to their big-data problem was interesting, to say the least.
Craigslist handles a boatload of postings every day, but as big as that load is, their daily incoming traffic is not enough to sink a relational vessel. However, managing their archive of postings was another story. "Historically, Craigslist had been a solid MySQL shop, but as they grew larger, their schema design and flexibility for data warehousing wasn't working out well," says Samantha.
The problem with persistence
At Craigslist, there was never any real problem using MySQL to handle incoming data and the thousands of postings that come into Craigslist every hour. The problem was the warehousing. I thought that for privacy reasons companies like Craigslist had to purge all of their data after a certain amount of time, but apparently the opposite is true. Apparently, there’s some legislation somewhere that says that they've got to archive every piece of data they've ever received. When you're the largest online classified ad and job posting site in the world, archiving all of that data quickly becomes a big data problem of unparalleled proportions..
So, what were the big-data problems Craigslist was encountering?
One major problem Craigslist encountered was the damage they did to the backend archive whenever they needed to change how data was persisted on the front end. It's one thing to change the schema for the databases handling the very manageable amount data that's come in within the past 60 days. But a schema change in the front end would then mean that the same change had to be perpetuated to all of the databases maintaining the archived data as well, which means updating a cluster of MySQL servers holding a billion or so records. That’s not only dangerous, but it’s incredibly time consuming as well.
Dealing with performance problems
This then dominoes into a performance problem, because when the archive is being altered, nothing is being archived, which means the live system is being pushed beyond its intended capacity. It’s a lose-lose situation, as the backend gets tripped up with a configuration change, while users accessing the website experience unusual latency and a lack of responsiveness.
"They looked a few different NoSQL solutions, one of which was MongoDB, which they selected at the beginning of 2011," says Samantha. "They decided to switch their content management over to MongoDB. This took about three months, and it transferred approximately 1.5 billion postings."
The MySQL and MongoDB solution
The neat thing about Craigslist though is the fact that they didn’t go completely hog-wild over their NoSQL solution. MySQL is still the workhorse on the front-end, and so it should be. User-centric data should never be schemaless, and correspondingly, the data being pulled in from users continues to populate a cluster of MySQL servers.
"How it works is MySQL is still the active database for all of the online properties and postings for Craigslist. But once a posting goes dead, MongoDB reads into the MySQL and writes that posting into JSON-like documents, which is how MongoDB stores its information. By doing that, Craigslist was able to provide a schema-less design which allowed them the flexibility to archive for multiple years of files without having to worry about failure or future flexibility and designs."
It's actually an interesting use-case, not only because it shows how NoSQL solutions are being applied in the real world, but it also demonstrates an interesting use for a NoSQL store. Data stores like MongoDB are often used when read access and speed is a priority. Writes to NoSQL stores can often be slower or even less reliable as consistency (C) is less of a priority than availability(A) and partitioning(P), as we all learned when we studied the CAP theorem. But in this case it would appear that read performance is a trivial concern. After all, who's going to be in a huge hurry to access a dead post that went into the party rentals section of Craigslist eight years ago?
In the end, the way Craigslist tackled their big-data dilemma by combining both relational and schema-less NoSQL solutions is a testament to how people in the industry are innovatively and creatively dealing with the problems that a constant influx of content and information presents.
NoSQL Distilled By Martin Fowler
High Performance MySQL by Baron Schwartz
MongoDB: The Definitive Guide By Michael Dirolf
MongoDB in Action By Kyle Banker
Taming The Big Data Tidal Wave By Bill Franks
The Well-Grounded Java Developer By Martijn Verburg