Microsoft Research System Sorted 1401 Gigabytes of Data in Just 60 Second

In the world cup of data, Microsoft Research just broke the speed record. The MinuteSort benchmark is considered the "World Cup" of data sorting and is concerned with measuring how much data can be sorted in sixty seconds. The MinuteSort benchmark which is designed to measure "how quickly data can be sorted starting and ending […]

Microsoft Research broke speed record at MinuteSort

In the world cup of data, Microsoft Research just broke the speed record. The MinuteSort benchmark is considered the "World Cup" of data sorting and is concerned with measuring how much data can be sorted in sixty seconds.

The MinuteSort benchmark which is designed to measure "how quickly data can be sorted starting and ending on disk." In this case, "Using a new technique called Flat Datacenter Storage (FDS) a team from Microsoft Research (MSR) has just sorted almost three times the amount of data (using only one-sixth of the hardware resources) as the previous record holder, a team from Yahoo!"

"The team's system sorted almost three times the amount of data (1,401 gigabytes vs. 500 gigabytes) with about one-sixth the hardware resources (1,033 disks across 250 machines vs. 5,624 disks across 1,406 machines) used by the previous record holder, a team from Yahoo! that set the mark in 2009," Microsoft Research states.

Microsoft said, the record is significant because it points toward a new method for crunching huge amounts of data using inexpensive servers. In an age when information is increasing in enormous quantities, the ability to move and deploy it is important for everything from web searches to business analytics to understanding climate change.

"In practice, heavy-duty sorting can be used by enterprises looking through huge data sets for a competitive advantage. The Internet also has made data sorting critical. Advertisements on Facebook pages, custom recommendations on Amazon, and up-to-the-second search results on Bing all result from sorting," MSR said.

FDS is the first general purpose system to break the terabyte barrier - fulfilling the late Jim Gray's long-term vision from a 1994 paper.

In a feature story on the Microsoft Research site, Elson remarked how this moves the game on from current state of the art MapReduce and Hadoop systems.

"Improving big-data performance has a wide range of implications across a huge number of businesses," Elson says. "Almost any big-data problem now becomes more efficient, which, in many cases, will be the difference between the work being economically feasible or not."

The post provides a lot more detail for the data geeks but the key for me is what this means in the emerging world of Big Data.