1 Implementation of a One-way L7 Web Switch with State Migration, Switch-back Masquerading, and Cookie Name Rewriting Ying-Dar Lin, Po-Ching Lin, Ping-Tsai Tsai {ydlin, pclin, motse}@cis.nctu.edu.tw Department of Computer and Information Science National Chiao Tung University, HsinChu, Taiwan Tel: +886-3-5731899 FAX: +886-3-5721490 Email: pclin@cis.nctu.edu.tw Abstract. A Web switch is normally deployed in front of a cluster of Web servers to dispatch requests for providing scalable Web services. L7 Web switches, which can execute content-aware routing, increase the flexibility of dispatching requests. One-way architectures, in which only the requests pass through the Web switch, can enhance its scalability because the Web switch bear lighter workload. Nevertheless, implementing a one-way architecture on the L7 Web switch is difficult. The difficulty is that the L7 Web switch has to establish a connection with the client before the request is received. However, the backend server must also have the connection information to be able to send its responses directly to the client. This work reviews contemporary solutions to this problem and presents a novel architecture, Direct Web Switch Routing (DWSR), for building a one-way L7 Web switch. One-packet state migration with pre-allocation is used to implement the one-way architecture, and switch-back masquerading and cookie name rewriting maintain connection and session persistence, respectively. These methods are implemented in the Linux kernel as a patch of the open-source Linux Virtual Server IPVS 0.9.8. The results indicate the presented architecture can handle as many as 14,285 requests per second. Keywords: Web switch, one-way architecture, connection persistence, session persistence I. Introduction Web sites potentially face a large number of user requests. A single Web server, however, is limited by its capacity and subject to the pace of technological evolution. It may become a single point of failure, which is particularly serious for an e-commerce site. Web clustering is a feasible and highly scalable architecture in which a Web switch is deployed in front of a cluster of Web servers. The Web switch, bearing a virtual IP address as if it were the unique contact point, may operate at either layer 4 or layer 7, depending on whether it can make the dispatching decision according to the request content. It is therefore commonly referred to as an L4 or L7 Web switch. Both architectures can be further classified as one-way and two-way architectures. In the one-way architecture1, only the requests pass through the Web switch, while the responses go directly to the clients. In the two-way architecture, traffic in both 1 This term varies in different works and products. Direct server return, switch-back, direct-routing are all synonyms. 2 directions must pass through the Web switch. The latter is considered less scalable because the Web switch must process the responses, which typically dominate the Web traffic. An L7 Web switch allows more flexible dispatching because it can search the request content for URLs, cookies or SSL identifiers, and dispatch the request according to them. Content-aware dispatching offers several advantages. Web content can be optimally partitioned and stored on various servers - an advantage that cannot be feasibly provided with an L4 Web switch. The cache locality on each server can also be improved. Presently, many Web sites use server side scripts to support dynamic content, such as that associated with database queries and shopping carts. An application layer session is established across several TCP connections on the same Web server for a particular period. An L4 Web switch may dispatch these connections to different servers, resulting in errors, because the information in this session is supposed to be handled by the same server. An L7 Web switch can maintain session persistence by identifying cookies or SSL identifiers. Despite the above advantages, implementing a one-way L7 Web switch is non-trivial. Three major difficulties arise: (1) State migration: The request content is unavailable until a connection has been established between the client and the Web switch, and the Web switch then establishes another connection with the server to pass the request, forcing the use of a two-way architecture. The server must somehow have the established TCP state from the Web switch to communicate directly with the client. (2) Connection persistence: HTTP 1.1 supports persistent connections in which multiple requests can be carried in a TCP connection [1], but these requests may be dispatched to various servers because the content is partitioned. A means of dispatching these requests to the correct servers while keeping the efficiency of persistent connections must be considered. (3) Session persistence: Session persistence is normally maintained by identifying a unique cookie inserted from the server. The Web switch has no way to know the cookie and cannot dispatch the subsequent requests according to the cookie because the response bypasses the Web switch in a one-way architecture. This work focuses on these three difficulties and reviews the contemporary solutions in both the research field and the area of commercial products. Readers are referred to Cardellini et al. for a general review of locally distributed architectures [2]. This work also introduces a one-way L7 Web switch architecture, namely Direct Web Switch Routing (DWSR), that can migrate the TCP state to the backend servers, and support connection and session persistence elegantly using three mechanisms - one-packet state migration with pre-allocation, 3 switch-back masquerading and cookie name rewriting. These methods are implemented in Linux kernel modules, and the implementation is benchmarked for their internal and external performance. The design philosophy of this architecture is compared with that of the contemporary solutions. The rest of this paper is organized as follows. In Section II, III, and IV, we review the existing solutions to TCP state migration, connection persistence and session persistence, respectively, and propose new solutions. The differences of our architecture from existing ones will be outlined. Section V presents the implementation and benchmark of DWSR. Section VI draws the conclusions. II. State migration in Web switch design A. Existing solutions to designing a one-way L7 Web switch In an HTTP connection, a client will not send its request until it has established the connection with the contact point of a Web site - normally a Web switch in a locally distributed architecture. In a one-way architecture, the backend server must know the TCP state from the Web switch to communicate with the client directly. The migration inevitably requires either modifying the kernel or adding special kernel modules on the backend servers. Deploying these specialized servers is generally acceptable. Current solutions are variants of the TCP handoff protocol [3-7] and the proprietary Resonate Exchange Protocol (RXP) [8] in a commercial product. These solutions differ in the format of the state information to be migrated and the location where the function of the request dispatcher is implemented. The client information (including the request) and the TCP state information can be carried in a special message or the request can be encapsulated intact with a special header. In [3-5, 8], the request dispatcher is located on the front-end Web switch. In contrast, the dispatching function can be implemented on backend servers without a front-end Web switch. In the latter case, the client can select a server using round-robin DNS [6], or the server can filter the SYN packet multicast from a front-end switch or router [7] to decide to service or not. The request dispatcher on these servers transfers the request and the connection with TCP handoff to another server if the requested content is not on them. The backend servers play the role of a front-end Web switch, as well as providing the Web content. B. One-packet TCP state migration with pre-allocation In the above solutions, the front-end Web switch has to perform a three-way handshake with the client before receiving its request. This is called delayed binding and involves extra work by the Web switch. This work presents a mechanism, one-packet TCP state migration with pre-allocation, which acts like an L4 Web switch. The Web switch can pre-allocate a 4 server according to its own dispatching algorithm after it has received the SYN packet from the client. It passes this packet to the chosen server and allows the server to complete the three-way handshake with the client. Lacking the request content, the Web switch may dispatch to an incorrect server. Figure 1 presents the process when the pre-allocation is correct. The process with incorrect pre-allocation is presented later. A filtering module is installed on each server to intercept the packets between the Web switch and the TCP/IP protocol stack on the server to modify both the incoming and outgoing packets if necessary. In Step 1, the Web switch receives the SYN packet and passes it to Server 1, according to its dispatching algorithm, for example in a round-robin manner. All servers share the same virtual IP address (VIP) as the Web switch besides their own IP addresses, so the Web switch must alter the destination MAC address to the server.s MAC address. The filtering module transparently passes the packets during the three-way handshake, and then the Web switch receives the request in Step 7. If Server 1 is the correct server according to the policy rules, the Web switch simply passes the request to it. Figure 1: Steps of one-packet TCP state migration with correct pre-allocation Figure 2 presents the process when the pre-allocation is incorrect. If the request in Step 7 should be dispatched to another server, say Server 2, according to the policy rules, then the Web switch first sends an RST packet to notify the filtering module on the pre-allocated server. This filtering module holds the RST packet for a short period, and then the connection is terminated. By maintaining the connection for a period, the connection can be reused without another three-way handshake if a later request is assigned to this server during this period. After the RST packet has been sent, the Web switch sends the request to Server 2. Note 5 that the sequence number WSEQ+1 of the earlier connection with Server 1 is also transferred in the request, from which Server 2 can continue the connection. Because no previous connection is made to Server 2, the filtering module on it establishes a connection with the TCP/IP protocol stack from Step 10 to Step 12, and then passes the request. The response is sent directly to the client in Step 14. The sequence number RSEQ+1 differs from WSEQ+1, so the filtering module on Server 2 must change the sequence number. Figure 2: Steps of one-packet TCP state migration with incorrect pre-allocation Chow [9] presents a similar architecture, which also has a filtering module on the server and a pre-allocation scheme. However, the architecture presented herein supports connection persistence. Each request in a TCP connection must pass through the Web switch, so the Web switch can dispatch the request to a correct server. Delegating the whole TCP connection to the client and the server requires that the filtering module must be responsible for the request redirections. The architecture also allows an earlier connection to be maintained for a period and to be reused as required (See more discussion in Section III). It also supports session persistence by cookie name rewriting - a feature not exhibited by Chow's architecture. The presented architecture is simple and highly scalable, because the Web switch examines only the requests. It has two limitations, however. First, it does not support HTTP pipelining [1], which allows multiple requests to be sent before the response of a previous request returns. The acknowledgment number in a request may be that for a different response. Fortunately, pipeline connections are rarely applied in popular browsers, such as Internet Explorer (IE) and Mozilla. According to our experiment, IE opens two connections to the Web server and sends requests in them rather than pipelining in a single connection. Mozilla 6 turns off pipelining by default and presents a warning to the users to turn on the option. This possibility is ignored when connection persistence is referred to herein. Second, TCP options such as SACK or ECN are negotiated during the connection establishment, so all of the servers in the cluster must have the same options. III. Maintaining HTTP connection persistence A. Existing solutions to maintaining connection persistence The way in which an L7 Web switch supports connection persistence is non-trivial. A subsequent request in a connection may be assigned to a different server, but that server cannot receive the request directly without first establishing a connection. Both one-way and two-way architecture must tackle this problem. This work investigates commercial solutions by watching the packets using Sniffer and reviews the solutions in the academic field. (1) Rewriting HTTP header Some commercial products, such as Foundry ServerIron and Radware WSD, degrade a persistent connection to a non-persistent one to circumvent the problem. The Web switch searches the request for "Connection: Keep-Alive", and then rewrites it as "Connection: close". The server will actively close the connection when it receives this instruction. Despite its simplicity, this solution cannot exploit connection persistence. (2) Using HTTP status code . 302 The Web switch can dispatch a request to another server if required. Cisco CSS series operate in this manner. When the requested content is on another server, the Web switch sends the RST packet to the current server to close the present connection. The server then responds to the client with status code 302 and the new URL to be followed by the client. The status code 302 specifies that the requested content is located temporarily in a different URL [1]. The client initiates a new request with the new URL and the Web switch then redirects the new request as usual. This solution is better because a connection keeps persistent unless a redirection must be made required. (3) Multiple TCP connection handoff and backend request forwarding Mohit Aron et al. [4] proposed multiple TCP connection handoffs to maintain connection persistence. Transferring the TCP state from the Web switch between backend servers is complex. They also proposed a simpler solution named backend request forwarding instead of handing off the connection to another server. If the requested content is unavailable on a server, the Web switch can instruct this server to fetch the content from another server, and then to return the content through its own client connection. The method increases latency and demands more bandwidth. It is appropriate for short responses. 7 B. Switch-back masquerading The requested content in a persistent connection may be on another server - a situation similar to that of selecting an incorrect server in the pre-allocation mechanism. Hence, this solution can easily support persistent connections. By sending RST to halt the wrong connection and redirecting the request to the correct server as does the pre-allocation scheme, the Web switch can always maintain a correct connection. Figure 3 illustrates an example. Suppose a request has been sent to Server 1. When a new request whose content is on another server (Server 2 in this example) is received in Step 1, the Web switch sends RST to Server 1 in Step 2. Upon receiving RST, the filtering module on Server 1 maintains the connection state for a period of, say 15 seconds, because the content of a later request may soon be dispatched to it again. If no more requests are sent to it during this period, then the filtering module simply sends the RST into the protocol stack to close the connection; otherwise, the current connection can be reused. When receiving a new request in Step 3, the filtering module on Server 2 attempts to determine whether the connection state has been established. If not, it rebuilds the connection from Steps 4 to 6 and sends the request to the protocol stack in Step 7. Assume that the Web switch receives a request to Server 1 again in Step 11 in 15 seconds. Then switch then sends RST to Server 2 to stop the connection in Step 12 and the request to Server 1 again in Step 13. The connection on Server 1 can be simply reused in Step 14 and the subsequent steps. Note that the sequence number from Server 2 is carried on the acknowledgment number of the request. The filtering module on Server 1 knows how to continue the last sequence number, and alters the following sequence numbers and acknowledgment numbers accordingly. The requests could be switched back to an existing connection, explaining why the mechanism is named switch-back masquerading. 8 Figure 3: Steps of the switch-back masquerading mechanism Table 2 summarizes the above mechanisms for maintaining connection persistence. Unlike existing solutions, switch-back masquerading maintains the benefits and scalability of connection persistence. After one-packet state migration with pre-allocation has been implemented, the additional implementation cost of switch-back masquerading is quite low. Rewriting HTTP header Using HTTP status code - 302 Multiple TCP connection handoff Backend request forwarding Switch back masquerading at filtering module Mechanism rewrite HTTP header to close the existing connection respond status code 302 hand over the TCP state among servers fetch requested content from another server if content is not on the server reuse an existing connection if switching back Architecture two-way two-way one-way one-way one-way Keep the benefits of connection persistence no yes until changing server yes yes yes Implementation cost low low high low low Scalability low medium high medium high Table 1: Comparisons of mechanisms for tackling the connection persistence problem IV. Supporting the session persistence A. Existing solutions to maintaining session persistence An application layer session is defined to retain user information, program variables and logs on the same Web server across several TCP connections during a Web transaction. Cookies are normally recognized to maintain session persistence. When initiating a session, a server inserts a unique session cookie in the response. The requests from the same client in the following connections will carry this session cookie and the Web server will use it to identify the client in different connections. A Web switch is content-aware and can recognize the cookie to maintain the session. Some solutions already exist. 9 (1) Pre-definition by programmers The programmer adds codes to insert a unique cookie. For instance, server A inserts a cookie "SERVER=A"; server B inserts "SERVER=B", and so on. The Web switch configuration should be consistent with the cookies designed by Web programmers. (2) Automatic cookie insertion The Web switch inserts a cookie that contains switching information, such as the server's IP address. It can search for this cookie on subsequent requests to ensure correct redirection. This approach is transparent, but its overhead is high. The Web switch must modify the length of the HTTP header and insert a string into the packet. The packet may need to be fragmented if its size is to be larger than MTU after insertion. (3) Cookie learning When the response packets pass through the Web switch, the Web switch records the session cookie and the server from which the packets come. The Web switch then parses the HTTP header in the subsequent requests and looks up the mapping of the session cookie and the server. This approach is also transparent and the overhead is low. B. Cookie name rewriting at filtering module Automatic cookie insertion and cookie learning are not designed for one-way architectures because the response must pass through the Web switch despite the advantage of easier configuration. This work presents a cookie name rewriting mechanism that keeps this advantage in one-way architectures. When the response packets pass through the filtering module, this module searches for the session cookie and rewrites its first eight characters to carry special switching information. For instance, .PHPSESSID=ABC. is rewritten as .DR0189C1D=ABC.. The first two characters are a keyword indicating that the cookie has been modified to carry switching information. The following three characters identify the content types. The final three characters represent the server identifier. Fragmentation is avoided because the header length is unchanged - an advantage not available in cookie insertion. The following requests in the same session carry this special cookie so that the Web switch can dispatch the request according to the modified cookie name. The only constraint is that the cookie name should be longer than eight characters, but this constraint can easily circumvented in practice. This solution can coexist with URL switching. The Web switch first examines the URL, and then checks to determine whether there is a cookie bearing the corresponding content identifier of the type of content implied in the URL. If so, it extracts the server identifier from the cookie name and dispatches the request to that server. Otherwise, a normal 10 URL switching is executed. V. Implementation and benchmark A. Functional description and implementation of the DWSR The proposed implementation is integrated as a patch of Linux Virtual Server IPVS 0.9.8 [10] - a popular open-source L4 load balancer project. IPVS is a Netfilter [11] module, which defines five hooks in the path of packet traversal: PRE-ROUTE, LOCAL-IN, FORWARD, LOCAL-OUT and POST-ROUTE. The system comprises two parts - one is the DWSR Web switch module and the other is the filtering module on the server. The code is an open source available at [12]. Figure 4 depicts the functional blocks. On the left side is the Web switch in which the DWSR module is hooked at LOCAL-IN to intercept the request. The dispatcher looks up the connection table and the rule table to make the dispatching decision. Each server has the same virtual IP address (VIP) as that of the Web switch, besides its own real IP address (RIP). They are configured not to respond to the ARP request so that only the Web switch will receive the request. The Change MAC module changes the destination MAC address to the target server's MAC address to pass the request. The filtering module hooked at LOCAL-IN and LOCAL-OUT intercepts requests to the server. If the server has the TCP state of the connection, the request is passed to the TCP masquerading module, which alters the sequence number; otherwise, the module replays a three-way handshake with the TCP/IP protocol stack to rebuild the TCP state. The response also passes through the TCP masquerading module and then goes directly to the client. B. Benchmark methodology The DWSR architecture and some commercial two-way L7 Web switches are benchmarked and the request rates are compared. These L7 Web switches also support one-way architecture at layer 4, but not at layer 7, perhaps because of the difficulties discussed herein. Four powerful PCs with dual Athlon XP 1700+ and gigabit NICs are used, and WebBench 4.1 is used to generate a huge amount of Web traffic. Appendix I lists the workload. It is a mix of GIF images, hypertext documents, and the status code 404, .Not found. [1], and so represents a typical workload in a practical situation. Six PCs with 100 Mb/s NICs run Apache 1.3 as Web servers. Each server can service around 1700 requests/s and the response traffic is around 85 Mb/s. 11 Figure 4: Functional blocks of the DWSR system C. Benchmark result Figure 5 compares the request rate of the DWSR with those of the commercial products. The architecture competes favorably with the commercial products in terms of performance. Unlike the request rates of the commercial products, which are measured when the products are stressed to the limit of their capacity, the request rate of 8,878 requests per second is not the upper limit achievable by the DWSR architecture. The scalability is almost linear in the number of servers, but is limited by the number of servers available for the experiment. DWSR delegates most tasks, such as TCP masquerading and connecting with clients to the filtering modules on the servers, so the throughput is highly scalable. The time taken by DWSR to process a request is 69.723 µs on an Intel Pentium III 1 GHz PC, so a request rate of around 14,285 requests per second is estimated to be achievable on the platform. request rate 5402 6020 5151 2840 2763 8878 0 2000 4000 6000 8000 10000 Nortel ACE switch 180e Cisco CSS 11154 F5 BIG-IP HA+ Controller Foundry ServerIron XL Radware WSD DWSR r e q u e s t s / s e c o n d Figure 5: Comparison of the request rate of the DWSR with those of commercial products 12 The other one-way architectures, such as those presented by Aron et al. [4] and Andreolini et al. [5] are similarly linearly scalable. A request rate up from 6,000 to 9,000 is observed. The workload and the platform differ in the experiments, so their numerical results are not comparable with ours. Only the differences between their designs and the design herein are discussed. D. Internal benchmark of the DWSR The elapsed time of each function is also measured to identify the bottleneck of the DWSR. A special register that increases the count by one every clock, which is 10-9 on a Pentium III 1 GHz CPU, is used. Table 2 presents the results. For each request, the DWSR may have to create a new connection entry or look up its corresponding entry if the connection has been created, and then dispatch it. This step takes an average of around 46 µs. Layer 7 processing takes around 24 µs, 43% of which time is used to generate a RST segment; 40% is used to look up service entry, and 17% is required to parse the file extension of the URL pattern. The content parser is a bottleneck if the entire content is parsed, which alone takes around 10 to 15 µs to search for the URL pattern. Another time-consuming part is the generation of an RST segment, but this only happens when the selected server is changed. The bottleneck of the filtering module on the server is the rebuilding of the TCP state, which includes around 100 µs to replay the three-way handshake with the kernel. Enabling cookie persistence greatly increases the time because it involves searching for the cookie and extracting the destination server identifier from that cookie, demanding heavy content processing. A good string matching algorithm is required to accelerate the processing. DWSR Web switch Server filter IPVS processing 46 µs Rebuild TCP state 101.39 µs L7 processing 24 µs TCP masquerading 3.54 µs Enable cookie persistence 12 µs Enable cookie persistence 4.47 µs Table 2: Results of internal benchmark VI. Discussion and Conclusion This work presents a one-way L7 Web switch architecture and explores its scalability. One-packet TCP state migration with pre-allocation and switch-back masquerading facilitate TCP state migration and connection persistence. Cookie name rewriting is an easy and powerful means of supporting session persistence when URL switching is enabled. These solutions are highly scalable and not difficult to deploy. Two serious bottlenecks of the DWSR are identified from internal benchmarking. One is content rule matching and the other is rebuilding the TCP state at the filtering module on the server. The former is a general issue associated with Web switching. The latter is not as 13 serious because the number of filtering modules is scalable with the number of servers. The delay will not increase when the number of servers increases. Some topics must be studied further to enable a fully-functional Web switch to be built. For instance, a better dispatching algorithm could increase the probability of correct pre-allocations and should be studied further because it could improve the overall performance. Differentiated services could be provided along with load balancing. Further implementations should also involve supporting SSL identifiers to maintain session persistence. The extra load that the filtering module on the Web server brings to the server also deserves further discussion. Our preliminary implementation has been open-source, as part of the Linux Virtual Server Project, and hopefully it will be further improved by the Internet community. Appendix I. Workload in the experiments Class Percentage Class Percentage CLASS_233.gif 20 CLASS_6040.htm 14 CLASS_735.gif 8 CLASS_11426.htm 16 CLASS_1522.gif 12 CLASS_22132.htm 7 CLASS_2895.gif 20 CLASS_404 2 (CLASS_233.gif means a GIF image file of 233kB, occupying 20% of the total workload. The other class names are defined similarly. CLASS_404 is the response with the error code 404, which means the requested content is not found.) References [1] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, .Hypertext Transfer Protocol - HTTP/1.1,. IETF RFC 2616, June 1999. [2] V. Cardellini, E. Casalicchio, M. Colajanni, and P. Yu, .The State of the Art in Locally Distributed Web-Server Systems,. ACM Computing Survey, vol. 34, no. 2, pp. 263-311, June 2002. [3] V. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum, .Locality-Aware Request Distribution in Cluster-based Network Servers,. Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 205-216, San Jose, CA, Oct. 1998. [4] M. Aron, P. Druschel, and W. Zwaenepoel, .Efficient Support for P-HTTP in Cluster-Based Web Services,. Proceedings of the USENIX Annual Technical Conference, Monterey, CA, June 1999. [5] Mauro Andreolini, Michele Colajanni, and Marcello Nuccio, .Scalability of Content-aware Server Switches for Cluster-based Web Information System,. WWW 2003, Budapest, Hungary, May 2003. [6] W. Tang, L. Cherkasova, L. Russell, and M.W. Mutka, .Modular TCP Handoff Design in 14 STREAMS-Based TCP/IP Implementation,. Proceedings of the First International Conference on Networking (ICN-2001), Colmar, France, July 2001. [7] D. Kerdlapanan and A. Khunkitti, .Content-based load balancing with multicast and TCP-handoff,. Proceedings of the 2003 International Symposium on Circuits and Systems (ISCAS-2003), vol. 2, pp. 25-28, Bangkok, Thailand, May 2003. [8] Resonate, .TCP Connection Hop,. Resonate Technical White Paper, Apr. 2001. [9] C. Chow, .Introduction to Linux-based Virtual Server and Content Switch,. Tutorial presented in PDCAT 2001, http://cs.uccs.edu/~chow/pub/conf/pdcat/scalableRD.html, Taipei Taiwan, June 2001. [10] Linux Virtual Server Project, http://www.linuxvirtualserver.org. [11] http://www.netfilter.org/. [12] http://speed.cis.nctu.edu.tw/~motse/dwsr.htm. [13] M. Aron, D. Sanders, P. Druschel, and W. Zwaenepoel, .Scalable content-aware request distribution in cluster-based network servers,. Proceedings of the 2000 USENIX Annual Technical Conference, San Diego, CA, June 2000.