Wednesday, September 17, 2008

The Ten Concepts for Software Engineer (3)

5. Hashing
The idea behind hashing is fast access to data. If the data is stored sequentially, the time to find the item is proportional to the size of the list. For each element, a hash function calculates a number, which is used as an index into the table. Given a good hash function that uniformly spreads data along the table, the look-up time is constant. Perfecting hashing is difficult and to deal with that hash table implementations support collision resolution.

Beyond the basic storage of data, hashes are also important in distributed systems. The so-called uniform hash is used to evenly allocate data among computers in a cloud database. A flavor of this technique is part of Google's indexing service; each URL is hashed to particular computer.

Hash functions can be complex and sophisticated, but modern libraries have good defaults. The important thing is how hashes work and how to tune them for maximum performance benefit.


6. Caching
No modern web system runs without a cache, which is an in-memory store that holds a subset of information typically stored in the database. The need for cache comes from the fact that generating results based on the database is costly. For example, if you have a website that lists books that were popular last week, you'd want to compute this information once and place it into cache. User requests fetch data from the cache instead of hitting the database and regenerating the same information.

Caching comes with a cost. Only some subsets of information can be stored in memory. The most common data pruning strategy is to evict items that are least recently used (LRU). The prunning needs to be efficient, not to slow down the application.

A lot of modern web applications, including Facebook, rely on a distributed caching system called Memcached, developed by Brad Firzpatrick when working on LiveJournal. The idea was to create a caching system that utilises spare memory capacity on the network. Today, there are Memcached libraries for many popular languages, including Java and PHP.

7. Concurrency
Concurrency is one topic engineers notoriously get wrong, and understandably so, because the brain does juggle many things at a time and in schools linear thinking is emphasized. Yet concurrency is important in any modern system.

Concurrency is about parallelism, but inside the application. Most modern languages have an in-built concept of concurrency; in Java, it's implemented using Threads.

A classic concurrency example is the producer/consumer, where the producer generates data or tasks, and places it for worker threads to consume and execute. The complexity in concurrency programming stems from the fact Threads often needs to operate on the common data. Each Thread has its own sequence of execution, but accesses common data. One of the most sophisticated concurrency libraries has been developed by Doug Lea and is now part of core Java.


8. Cloud Computing
In our recent post Reaching For The Sky Through Compute Clouds we talked about how commodity cloud computing is changing the way we deliver large-scale web applications. Massively parallel, cheap cloud computing reduces both costs and time to market.

Cloud computing grew out of parallel computing, a concept that many problems can be solved faster by running the computations in parallel.

After parallel algorithms came grid computing, which ran parallel computations on idle desktops. Grid computing is widely adopted by financial companies, which run massive risk calculations. The concept of under-utilized resources, together with the rise of J2EE platform, gave rise to the precursor of cloud computing: application server virtualization. The idea was to run applications on demand and change what is available depending on the time of day and user activity.

To be continued..

No comments: