WEB LOAD BALANCING

This report will analyzes the "Load Balancing a Cluster of Web Servers Using Distributed Packet Rewriting".The aim of this report is to study the research paper authored by Luis Aversa and Azer Bestavros,some other referecence and analyze th e different schemas presented on load balancing.This report will present differe nt issues needs to be consider to improve the situation of web load balancing.

Introduction

With the overwhelming growth of the Internet, IP "appliances" are becoming essential parts of IP networks. IP load balancers are part of these appliances. Load Balancing in general is a distributed processing and communications activity spread evenly across a computer network so that no single device is overwhelmed. Load balancing is especially important for networks where it is difficult to predict the number of requests that will be issued to a server. The requests are directed to the most-available server based on utilization. The requests can also be directed based on content type such as video and text.

IP load balancers bring two major advantages to a multi-server environment.

As web sites and server farms or clusters field more and more clients, the constant upgrading of server hardware becomes not only tedious but also economically unsound. With the use of an IP load balancer, a cluster of identical servers can be built to seem like one super-powerful server, managed by the load balancer.

Load balancing can provide 100 percent fault tolerance, redundancy and fail over. An example where load balancing is very critical would be E-Commerce Businesses: If the data must go through, as with credit card transactions and online catalog sales, then server downtime is never tolerated. There should be some form of load balancing here. IP load balancer provides this resilience within a server cluster. The client can receive smooth and continuous service, which is the ultimate goal of on-line services.

Load Balancing Methods

Load balancers come in two forms: Software based and Hardware based.

Software based load balancing is usually installed on a computer platform other than the server being load balanced. It is a low cost solution if deployed on existing software and hardware platforms. There is often no need to place any additional hardware box into the network. However, high costs may be incurred if a new computer has to be purchased for the sole purpose of running the software at high speed. In addition, there may be compatibility issues with existing hardware and software too. With all the compatibility issues, multiple points of failure, customization of both software and hardware is often necessary, resulting in extra deployment costs. Throughput and capacity are highly dependent on the hardware platform, operating system, and other factors. The results are neither guaranteed nor standardized. Also this solution is limited when using HTTP re-direction. Generally the result is lower throughput and reductions in other performance metrics.

Hardware based load balancing comes in three flavors: PC based, Switch based, and Router based.

PC Based: A software solution put into a PC and sold as a package. Sometimes the vendor creates a bridge by installing two NICs in the PC. This offers some router-like functionality and saves the time and effort of integrating a software solution with the existing equipment. However, the result is often expensive. In fact, this can end up in a lot more expenditure than anticipated. Also, there are many failure points in a PC, far more than found in a switch or router.

Switch Based: A Fast Ethernet/Gigabit Ethernet switch has embedded multiprocessors, making it capable of performing local load balancing among typical servers, providing a Virtual Server IP interface to a set of Real Servers on the back-end. These allow fast traffic pass-through, but do not offer many of the advanced load balancing features found in the router-based solution which is discussed below. In particular, switch-based load balancers lack support for distributed networks and calculations of network proximity.

Router Based: With a router based load balancer, router functions can be used to establish static and dynamic pathways to facilitate traffic throughput and redirection. Also, there is seamless integration with the existing network, unlike a PC based solution. The router based solution works with any O/S or platform. Routers are robust and rarely fail.

 

Overview of Research Paper

The schema discussed in the research paper uses Round Robin DNS to publish individual addresses of all machines in the cluster of web servers, thereby distributing the responsibility of re-routing requests to each machine. All the hosts of the distributed system participate in connection routing by using a method called Distributed Packet Rewriting (DPR). This distributed approach has better scalability and fault-tolerance than the predominant use of centralized, special-purpose connection routers. DPR is an IP level mechanism that equips a server with the ability to redirect an incoming connection to a different server in the cluster based on the very first packet that is received from the client called the SYN packet. Using the packet information the DPR enabled server either forwards a connection to a different server or passes it to its application layer to be served by that layer.

There are two version of DPR, Stateless and Stateful. Stateless DPR does not require any information different from what can be found in the headers of each packet in a connection. It forwards a packet based on the value of the hash function. Stateful DPR forwards a packet based on the information that is stored in a table of translation. According to experimental results a Stateful DPR achieves better throughput and a faster mean response time to the client.

Illustration of DPR Schemes On How It Works (See Fig 1)

Fig 1

For simplicity we assume three servers serving 4 clients. Let the servers be denoted as Servers 1, 2 and 3, clients be denoted as Clients A, B, C and D. Let us assume Server 1 receives an original request from Client A. But Server 1 is not in a position to serve Client A as it is overloaded with requests from Client B and C. So Server 1 will redirect the incoming new request to either Server 2 or Server 3. Let us assume that the new request is now redirected to Server 3 based on the load information about other servers that is stored on it. Here Server 1 acts as a router because it rewrites the packet received from Client A into Server 3. Now Server 3 may be servicing other requests too. So Server 3 has to differentiate between packets that are re-routed and packets that come directly from the client. To differentiate, Server 1 uses IP-IP encapsulation for forwarding packets to other servers. The forwarding server encapsulates the original packet received from client A inside another IP packet, which is then re-routed to Server 3. Server 3 directly serves the client A thereafter based on the source IP information of the client contained in the encapsulation that was forwarded to it by Server 1. Subsequent requests always come to Server 1 from where it is redirected to Server 3. A routing table is maintained with information about redirected connections for serving subsequent requests.

Each one of the servers in the cluster acts like a router. The routing decision is made after checking the load on different servers in the cluster. The following are the different metrics are used to estimate the load on the servers:

  1. The total number of open TCP connections each machine in the cluster has at any given time,
  2. The CPU utilization of each machine in the cluster,
  3. The number of redirected TCP connections of each machine in the cluster, and
  4. The number of active sockets of each machine in the cluster.

In addition we can combine the above metrics to determine the load by using different weights on each one of them.

If no load packet is received from one machine for a certain number of seconds, then the entry of this machine is deleted from the load table to avoid redirecting the request to a machine that is not running. Also using IP-IP to redirect connections allows us to have servers in different networks.

Advantages of Using DPR

Analysis of Research Paper

The different options to load balance are growing daily and it is important to distinguish characteristics that make a good load balancing scheme.

Any good load balancer should address the following issues:

Characteristics Discussed in this Research Project

There are two versions of DPR discussed: Stateless and Stateful. The major problem of Stateful DPR is that it will keep forwarding packets to a server though it may have failed midway. So there will not be a dynamic readjustment.

Research paper says: In order to overcome the lack of dynamic readjustment in a Stateful redirection, we can have a wait policy. If the server does not respond within a specified time period, the request can be redirected to another server whose work load is low. The translation table should be updated so that future requests are sent only to the redirected server. Also as far as the speed goes a Stateful redirection is a better policy than Stateless redirection. This is one good way of taking care of web load balancing.

A solution must, at all times, be aware of the status and health of all the servers within the clusters it is responsible for. The solution must periodically monitor the servers’ physical and application layer health and must be able to remove a server from service if there are any problems.

Research paper says: Scheme to monitor the server health is implemented. This is done by three user processes and seven new system calls. One process is in charge of broadcasting the local server’s own load periodically. A second process is in charge of waiting for the load of the other servers that are participating in the DPR protocol to be multicast. The third process is in charge of cleaning up of the load and the routing tables.This technique is one way of taking care of web load balancing.

One of the main reasons load balancing solutions are deployed is to provide full fault tolerance and high availability for a server farm. One of their most important tasks is to provide a full-availability solution in case of server failure.

Research paper says: The paper uses RR-DNS to publish individual addresses of all machines in the cluster of web servers to distribute the responsibility of re-routing requests to each machine.This techinque may not work in case of if the server fails and this server is the one to which has redirected the request has failed.Hence we should have a manager which will decide if a server which has redirected the request has failed.A router based solution does not have this problem.

Connection routers present a single IP address while performing packet rewriting, load balancing and network gateway functions.

Research paper says: Though the funtions of DPR do not completely replace those of a centralized connection router, a simple solution such as RR-DNS is sufficient for providing the illusion of a single IP address and standard routers are sufficient for providing gateway functions.This technique is good.

The client should not know that his request has been redirected to a different server because of a possible failure of the server servicing the request.

Research paper says: The schema discussed in this paper, does not achieve full redirection. The paper says "if no load packet is received from one machine for a certain number of second, then the entry of this machine in the load table is deleted to avoid redirecting connection to a machine that is not running".The research paper does not discuss about further requests that come from the same client after a connection is established. Will the client~Rs ongoing request be met without any interruption? I would suggest that there should be a traffic manager, which will direct further requests from that client to a different machine so that the availability of the web site approaches 100 percent.

Multiple servers may be purchased at different points in time, so we must see if the servers are capable of performing all the functions that is assigned to it without degradation in performance.

Research paper says: The research paper has not taken this issue into consideration.DPR functions may require additional server capacity. With the schema presented in the paper, every server should be cap able of doing the routing function in addition to serving the applications. Itis quite possible that we may need to keep all the computers equal or atleast po werful enough to perform the functions that are assigned to it.

Additional Characteristics needs to be taken care for load balancing on the web.

Policy-based server assignment gives the flexibility to redirect client requests to the appropriate server based on service policy and destination. With this feature, users are directed to special content servers without much delay. Popular content can be replicated on many servers, for example, while rarely accessed contents are stored only in a few servers.

A big web site is implemented with multiple servers purchased at different points in time. So some servers are more powerful than others. A good scheme should intelligently direct traffic to any of these multiple servers according to its capacity. This type of re-direction can achieve a significant difference in total throughput.

Some of the users/requests may be more important than other users/requests (paying customers for example). Priority server pools should be kept lightly loaded to service these requests.

There are many possible solutions for recognizing the importance of the content. For instance, important business transactions can be identified as SSL (Secure Sockets Layer) connections. Once identified that the request is important, it can be directed to a priority IP number for faster execution.

Heavy-tailed distribution of requests can create problems by increasing waiting time before the request can be served. The time taken to browse the content may be disproportionate to the size of the content that is being viewed. To overcome heavy-tailed distribution problem, there can be a policy where small sized tasks, that may be an overwhelming majority, are run on hosts that are loaded below the average system load and large sized tasks, of tiny minority, are run on hosts that are loaded above the average system load. This type of strategy is discussed in a paper titled "Task Assignment in a Distributed System: Improving Performance by Unbalancing Load" written by Mark E. Crovella and Mor Harchol-Balter.

Provisions for regular server maintenance without service interruption are vital. A solution for shut down must allow a server to be gradually taken out of service from the cluster in case the need arises (e.g. periodic maintenance, software upgrade, etc..). All users being serviced by a server about to be taken out of service must not experience any service interruption. In other words, a solution must allow current users to continue to be serviced by the same server while no new users are sent to that server.

A server must be allowed to gradually come into service after undergoing maintenance. This includes the ability to present user load to the server incrementally until the server reaches its full potential. It also includes the ability to allow for server software to stabilize after initiation. User should be given access to the activated server only after it is affirmed that the server software/hardware is stable and ready to provide service.

A load balancing solution can be considered more flexible if it can provide a means for configuring servers in a cluster that are meant only for backup purpose. In other words, a server or multiple servers will be configured for activation only if regular active servers were to fail. Single machines can then act as backup for multiple clusters. Backup servers will only be used if and when necessary, providing additional fault tolerance for the site. The policy for backup servers taking up the load should be easy to implement with the schema presented in the research.

If the servers are not in one geographic location then we need some form of traffic distribution that will inform the server nearest to the client to serve. We can have multiple locations implement a site, with the locations reasonably distributed to match the site’s user demographics. Then use some sort of traffic distributor to access the closest location in order to minimize the chance that congestion will be encountered.

One important consideration to have geographically distributed servers be replicated with the same content is that data storage is much cheaper than the cost of hauling the traffic over the internet. The savings in communications costs are more than enough to pay for the replicated servers.

Sometimes site access delays have nothing to do with server performance. They may be caused by delays within parts of the internet itself. Replicated server clusters can deliver content while avoiding congested areas of the internet.

As the usage of internet resources grow, available bandwidth for organizations and service providers on the internet becomes increasingly valuable. In some countries, the cost of increasing bandwidth to the internet is extremely high. Therefore, solutions for caching Web information, such as Cache (Proxy) Servers, are becoming more and more popular. Typically, such solutions cache web data when it is first accessed by an end user and reuses it when another user requests the same information. This saves time for the end user and bandwidth for the organization.

Caching solutions are based on the assumption that large groups of end users share common interests and therefore the same information will often be accessed more than once.

This is with respect to how easily we can maintain the servers and other equipments. The paper does not take into account the ease of maintenance. Whenever we make a change to the functionality of the routing programs we have to bring down all the servers and update each one of them because each one of them is responsible for routing the requests received from the clients. This may pose a problem especially when the servers are being accessed from all over the world.

Conclusion

This report is about load balancing and different methods used to load balance web applications and the other important characteristics that are not discussed in the Research paper.

A good load balancing solution in my opinion is one that meets all the characteristics that are listed out in this paper.

 

References

  1. Load Balancing a Cluster of Web Servers Using Distributed Packet Rewriting
  2. Author: Luis Aversa and Azer Bestavros

  3. Task Assignment in a Distributed System: Improving Performance by Unbalancin g Load
  4. Authors: Mark E. Crovella,Mor Harchol-Balter and Cristina Murta

  5. http://www-4.ibm.com/software/network/dispatcher/
  6. http://www.cisco.com/warp/public/cc/cisco/mkt/scale/locald/index.shtml
  7. http://www.f5labs.com/bigip/index.html