Google sorted 1PB with MapReduce

Google announced sorting of 1PB data: “It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers, writing it to 48,000 hard drives.” MapReduce is a key component of our software infrastructure that allows us to run multiple processes simultaneously. MapReduce is a perfect solution for many of the computations we run […]

Google announced sorting of 1PB data: “It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers, writing it to 48,000 hard drives.”

MapReduce is a key component of our software infrastructure that allows us to run multiple processes simultaneously. MapReduce is a perfect solution for many of the computations we run daily, due in large part to its simplicity, applicability to a wide range of real-world computing tasks, and natural translation to highly scalable distributed implementations that harness the power of thousands of computers.

We are excited to announce we were able to sort 1TB stored on the Google File System as 10 billion 100-byte records in uncompressed text files on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.

Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try. One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.

Full Article