CS294 Paper Reviews

Monday, September 19, 2011

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

The idea behind CAP is simple and the proof is straight forward to understand. I think the importance insight is given by Brewer himself that BASE and ACID is a spectrum. Different services fall onto a different point in this spectrum. For example, money transactions ideally should be ACID while most other internet based serviced can tolerate temporary inconsistency with clever designs. It would be nice to design a infrastructure that allows developers to tune the amount of inconsistency they can tolerate with parameters.

Cluster-Based Scalable Network Services

This paper gives an overview of one architecture for cluster-based network services. The paper is outdated but many ideas persisted to date:
1. Data semantics (BASE vs. ACID): many network services are willing to sacrifice temporary consistency for higher availability. And it's fine for most of them to give an approximate answer for some queries.
2. Scalability: Replicate components or prove non-replicable components are not bottlenecks.
3. Soft State: this is yet another way to improve availability. Soft state doesn't persist in data store. It is computed based on peer communications. This is still popular today in systems like Amazon Dynamo.
4. Layered architecture that helps developer to focus on the "content" of network services: TACC (Transformation, aggregation, caching, and customization) is essentially Google's MapReduce Framework. SNS is essential Google cluster's software infrastructures like GFS and Chubby.

Wednesday, August 31, 2011

An Intro to the Design of Warehouse Scale Machine 2

Chapter 3
WSM has three main hardware components: server, network fabrics, and storage hierarchy. The third chapter focused on how choices about server hardware are made.
1. benchmark to compare low end hardware and high-end shared memory system --> low end for cost efficient.
2. Too low end hardware will increase latency, increase software development overhead, slow down some parallel algorithms.
3. balanced design among components need to be considered at WSM level.

Chapter 4 Datacenter basics
Datacenter cooling and power system

Chapter 7 Dealing with Failures and Repairs
Failure is frequent with WSM.
1. relax on hardware self-correction. Use software to handle hardware failure.
2. don't relax hardware error detection. Too much burden for software to check correct execution.
Classify Cluster-level Failures
1. by severity
2. causes: configuration > software > hardware
Machine Level Failures
1. causes
2. prediction
3. repair

Tuesday, August 30, 2011

A Berkeley's View of Cloud Computing

This paper described Berkeley's understanding of Cloud Computing. It defines what Cloud computing is (SaaS + software and hardware infrastructure from Cloud providers), what the current spectrum of Cloud computing(AWS, Apps Engine, Azure) is, why it is more likely to succeed this time (economy of scale for Cloud providers, elasticity + minimal upfront cost for Cloud users, more applications like mobile interactive apps and compute-intensive apps for SaaS users), and current obstacles and opportunities.

An Intro to the design of Warehouse Scale Machines Review

Large scale Internet services drive the need for warehouse-scale machines (WSM). We notice that more and more services have migrated to Internet-based, especially after the release of Chrome OS + Chrome App Store. In the near future, industry would really need experts in the field of WSM design and implementation.

In this paper, Google shared some of their insights gained while fine tuning their own WSMs. WSM is different from traditional datacenters in that servers in WSM are relatively homogenous, are maintained by one group, and function more like a single machine . WSM presents challenges in design from their shear scale, high rate of hardware failures, and cost/performance trade off. Software infrastructures and software frameworks running on WSM is also difficult to write because they need to address different needs of apps (online or batch processing), varieties within the system (different access rate within rack and among racks), and detect/rectify failures.

I'm not familiar with traditional datacenter to compare.