The Google File System [GFS]
The Google File System is a scalable distributed file system for large distributed data-intensive applications. It is designed to run on inexpensive commodity hardware, same as MapReduce.
The largest cluster at that time (i.e., 2003) have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines.
The main problem GFS focused on, is to meet the rapidly growing demands of Google’s data processing needs. By using inexpensive commodity hardware, it’ll be so easy + scalable.
At the same time, it’s not compromising on typical qualities of previous distributed fs such as performance, scalability, reliability, and availability.
Their findings
Workloads will have large streaming reads (sequential) & small random reads.
Applications often batch and sort their small reads to advance through the file rather than go back and forth.
Writes are also similar to above Reads
Files are used for producer-consumer queues & many-way merging (K-way_merge_algorithm like tournament tree)
Bandwidth preferred over latency