Challenges of Existing Designs
There are many challenges for the three-tier designs in scaling to meet the demands of today’s web applications. Today’s traffic patterns in the data center have changed. Traditionally, the bulk of the traffic was north-south, meaning from the Internet to the data center and from the data center to the Internet. This affects where the firewalls are placed and how VLANs are designed. With the continued growth in data and storage and with the introduction of virtualization and three-tier web/app/database architectures, traffic is shifting toward an east-west pattern that challenges the three-tier design. This presents the following challenges:
Oversubscription between the tiers
Large flat L2 networks with stretched VLANs
Traffic hopping between tiers, inducing latency
Complexity of mechanisms used for IP subnet scarcity
Flooding of broadcast, unknown unicast, and multicast (BUM) traffic
Loop prevention via spanning tree
Firewall overload
These issues are described next.
Oversubscription Between the Tiers
One of the challenges of the three-tier architecture is due to oversubscription between the tiers. For example, 20 servers can be connected to an access switch via 1 GE interface, while the access switch is connected to the aggregation switch via a 10 GE interface. This constitutes a 2:1 oversubscription between the access and the aggregation layer. Traditionally, this created no problems because most of the traffic is north-south toward the Internet. Therefore, it was assumed that traffic is limited by the Internet WAN link, and oversubscription does not matter because there is ample bandwidth in the LAN. With the shift to east-west traffic, the bulk of traffic is now between the servers within the data center, so oversubscription between the access layer and the distribution layer becomes a problem. Access layer switches are normally dual homed into two aggregation layer switches. With the addition of servers, more access switches need to be deployed to accommodate the servers. In turn, more aggregation switches must be added to accommodate the access switches. Usually this is done in pods, where you duplicate the setup repeatedly. This is seen in Figure 1-3.
Figure 1-3 The Challenge of Scaling Three-Tier Designs
As you see, the addition of aggregation switches shifts the problem of oversubscription from access and aggregation to aggregation and core.
Large Flat L2 Networks with Stretched VLANs
With multitier applications, traffic moves around between web servers, application servers, and databases. Also, with the introduction of server virtualization, virtual machines move around between different servers. When virtual machines move around, they have to maintain their IP addresses to maintain connectivity with their clients, so this movement happens within the same IP subnet. Because the virtual machines could land anywhere in the data center, and because the IP subnet is tied to the VLAN, VLANs must be stretched across the whole data center. Every access and distribution switch must be configured with all VLANs, and every server NIC must see traffic of all VLANs. This increases the L2 flat domain and causes inefficiencies as broadcast packets end up touching every server and virtual machine in the data center. Mechanisms to limit L2 domains and flooding must be implemented to scale today’s data centers.
Traffic Hopping Between Tiers, Inducing Latency
Latency is introduced every time traffic crosses a switch or a router before it reaches the destination. The traffic path between nodes in the data center depends on whether the traffic is exchanged within the same VLAN (L2 switched) or exchanged between VLANs (L3 switched/routed).
Traffic exchanged within the same VLAN or subnet is normally switched at L2, whereas traffic exchanged between VLANs must cross an L3 boundary. Notice the following in Figure 1-4:
Intra-VLAN east-west traffic between W1, AP1, and DB1 in VLAN 100 is L2 switched at the access layer. All are connected to switch 1 (SW1), and switch 2 (SW2) traffic is switched within those switches depending on what ports are blocked or unblocked by spanning tree.
Intra-VLAN east-west traffic between W2, AP2, and DB2 in VLAN 200 is L2 switched at the access layer. All are connected to switch 2 (SW2), and switch 3 (SW3) traffic is switched within those switches depending on what ports are blocked or unblocked by spanning tree.
Inter-VLAN east-west traffic between W1, AP1, and DB1 and W2, AP2, and DB2 goes to the aggregation layer to be L3 switched because traffic is crossing VLAN boundaries.
Figure 1-4 Traffic Hopping Between Tiers
Every time the traffic crosses a tier, latency is introduced, especially if the network is heavily oversubscribed. That’s why it is beneficial to minimize the number of tiers and have traffic switched/routed without crossing many tiers. Legacy aggregation switches used to work at layer 2 and offload layer 3 to the L3 core; however, the latest aggregation switches support L2/L3 functions and route traffic between VLANs via mechanisms such as a switch virtual interface (SVI), which is described next.
Inter-VLAN Routing via SVI
An SVI is a logical interface within an L3 switch that doesn’t belong to any physical port. An SVI interface is associated with a specific VLAN. L3 switches have IP L3 switching/routing between VLANs. Think of them as logical routers that live within the switch and that have their connected SVI interfaces associated with the VLAN. This allows inter-VLAN routing on an L3 switch (see Figure 1-5).
Figure 1-5 Inter-VLAN Routing via SVI
As shown earlier, traffic between VLAN 100, subnet 10.0.1.0/24, and VLAN 200 10.0.2.0/24 was L3 switched at the aggregation layer. To do so, two SVI interfaces must be defined: SVI interface for VLAN 100 with an IP address 10.0.1.100, and SVI interface for VLAN 200 with an IP address 10.0.2.200. When routing is turned on the L3 switch, traffic between the two subnets is routed. Servers within VLAN 100 use SVI 10.0.1.100 as their default gateway, and servers within VLAN 200 use SVI 10.0.2.200 as their default gateway.
Complexity of Mechanisms Used for IPv4 Address Scarcity
Defining many subnets in a data center easily consumes the IP space that is allocated to an enterprise. IPv4 subnetting is a complex topic, and this chapter does not go into detail, but if you are starting to get lost with what 10.0.1.0/24 represents, here is a quick review.
A 10.0.1.0/24 indicates classless interdomain routing (CIDR). An IP address is 32 bits, such as a.b.c.d., where a, b, c, and d are 8 bits each. The /24 indicates that you are splitting the IP address, left to right, into a 24-bit network address and an 8-bit host address. A /24 means a subnet mask of 255.255.255.0. Therefore, 10.0.1.0/24 means that the network is 10.0.1 (24 bits) and the hosts take the last 8 bits. With 8 bits, you can have 2 to the power of 8 (2^8 = 255) hosts, but you lose hosts 0 and 255, which have special meaning for local loopback and broadcast, respectively, and you end up with 253 hosts. 10.0.1.1 inside subnet 10.0.1.0/24 indicates host 1 inside subnet 10.0.1. You can try on your own what 10.0.1.0/27 results in.
IPv4 addresses are becoming scarce. There is a shift toward adopting IPv6 addressing, which provides a much bigger IP address space. However, so far not everyone is courageous enough to dabble into IPv6, and many enterprises still use IPv4 and mechanisms such as NAT. NAT allows the enterprise to use private IP addresses inside the enterprise and map to whatever public IP addresses they are allocated from a provider.
If a provider allocates the enterprise 128.0.1.0/27 (subnet mask 255.255.255.224), for example, the enterprise has practically one subnet 128.0.1.224 with 32 (2^5) hosts, but you lose 0 and 255, so the address range is from 128.0.1.225 to 128.0.1.254. Therefore, the maximum public IP addresses the enterprise has are 30 addresses inside one subnet.
If the enterprise chooses to divide the network into more subnets, it can use /29 to split the /27 range into 4 subnets (29−27 = 2, and 2^2=4) with 8 (32−29 = 3, and 2^3=8) hosts each. The subnets are 128.0.0.224, 128.0.0.232, 128.0.0.240, and 128.0.0.248. Each subnet has 8 hosts, but for every subnet, you lose 0 and 255, so you practically lose 4×2 = 8 IP addresses in the process.
As mentioned, most implementations today use private IP addresses internally and map these IP addresses to public IPs using NAT. The Internet Assigned Numbers Authority (IANA) reserved the following three CIDR blocks for private IP addresses: 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. With the 128.0.1.0/27 allocation, an enterprise saves the 30 public addresses for accessing their servers publicly and uses as many private IP addresses and subnets internally.
However, early designs used what is called private VLANs to save on the IP space. Although this method is not straightforward and needs a lot of maintenance, it is still out there. This topic is covered briefly because it is revisited later in the book when discussing endpoint groups (EPGs) in Chapter 17, “Application-Centric Infrastructure.”
Private VLANs
This is a mechanism that allows one VLAN, called a primary VLAN (PVLAN), to be split into multiple sub-VLANs, called secondary community VLANs. The IP address space within the PVLAN can now be spread over multiple secondary VLANs, which can be isolated from each other. In a way, this removes the restriction of having one subnet per VLAN, which gives more flexibility in IP address assignment. In the traditional north-south type of traffic, this allows the same subnet to be segmented into sub-VLANs, where each sub-VLAN is its own broadcast domain, and flooding is limited. The secondary community VLANs allow hosts to talk to one another within the community. Any host from one secondary community VLAN that needs to talk to another secondary community VLAN must go through an L3 router for inter-VLAN routing. This is illustrated in Figure 1-6.
Figure 1-6 Primary and Secondary VLANs
The switch ports that connect to the hosts are community switch ports, whereas switch ports that connect to the router are called promiscuous ports. Ports within the same community talk to each other and to the promiscuous port. Ports between communities must go through the router. Note that hosts 10.0.1.3 and 10.0.1.4 belong to secondary community VLAN 1100, and hosts 10.0.1.5 and 10.0.1.6 belong to secondary community VLAN 1200. All of these hosts belong to the same IP subnet, but they are segmented by the secondary VLANs. So, although 10.0.1.4 talks directly to 10.0.1.3 through intra-VLAN switching, it needs to go through an L3 router to reach 10.0.1.5 or 10.0.1.6 through inter-VLAN routing.
The discussed example is simplistic. As you add multiple access switches and aggregation switches, you need to decide whether L3 routing is done in the aggregation layer or in the core layer. Also, as more and more sub-VLANs are created, such VLANs need to be carried across the network and maintained on every switch. Although these methods give flexibility in decoupling subnets from VLANs, they also add extra overhead. When Cisco’s implementation of EPGs is discussed, the decoupling of IP subnets from VLANs will become much clearer, and the whole deployment model will become automated as complexities are hidden from the end user.
Flooding of Broadcast, Unknown Unicast, and Multicast (BUM) Traffic
A main issue of L2 networks is flooding of BUM traffic. Each station needs to have an IP address–to–MAC address mapping to send traffic to a certain IP address on a LAN. If a source station is trying to reach another destination station and it does not know its IP-to-MAC address mapping, an address resolution protocol (ARP) packet is sent to a broadcast address. If a VLAN is configured, then the broadcast address is flooded by a switch to all switch ports that belong to that VLAN. Whenever a device sees the ARP request for its own IP address, it responds to the source station with its MAC address. The source station that sent the original ARP request stores the IP-to-MAC address mapping in its ARP table, and from then on it uses the MAC address it learned to send traffic to the destination station. This is seen in Figure 1-7.
Figure 1-7 ARP Flooding
As seen in Figure 1-7, server 1 (S1) with IP 10.0.1.1 must send traffic to server 2 (S2) with IP 10.0.1.2, which is in the same subnet. The switch is configured to have subnet 10.0.1.0/24 mapped to VLAN 100. If S1 does not know the MAC address of S2, it sends an ARP request with a broadcast MAC address of ffff.ffff.ffff. If the switch does not have the IP/MAC address mapping in its ARP table, it floods the ARP to all the ports that belong to VLAN 100. Once S2 sees the ARP, it responds with its MAC address 0000.0c02.efgh directly to S1 (0000.0c01.abcd). After that, S1, S2, and the switch update their own ARP tables.
ARP broadcast is one type of packet that is flooded. Other types could be unknown unicast or multicast. Say, for example, that S1 knows the MAC address of S2 and sends a packet to 0000.0c02.efgh; however, the switch flushed his ARP table and does not know the mapping. In that case, the switch floods that packet to all of its ports on VLAN 100 until a station replies; after that, the switch updates its ARP.
The problem of flooding consumes bandwidth, and many measures are taken to limit it. A side effect of flooding is broadcast storms that are created by loops, as discussed next.
Loop Prevention Via Spanning Tree
The problem with L2 networks is in the potential of broadcast storms occurring in case of loops. Say that an ARP packet is flooded over all switch ports, and that packet finds its way back to the switch that flooded it because of a loop in the network. That broadcast packet circulates in the network forever. Spanning trees ensure that loops do not occur by blocking ports that contribute to the loop. This means that although some ports are active and passing traffic, others are blocked and not used. This is seen in Figure 1-8. Note that an ARP packet that is flooded by SW1 and then flooded by SW4 could return to SW1 and create a loop. To avoid this situation, spanning tree blocks the redundant paths to prevent loops from occurring and hence prevent potential broadcast storms.
Figure 1-8 Loop Prevention Via Spanning Tree
The drawback is that expensive resources such as high-speed interfaces remain idle and unused. More efficient designs use every link and resource in the network.
Firewall Overload
Another issue with the existing three-tier design is that the firewalls that are traditionally connected to the aggregation layer become a catch for all traffic. These firewalls were originally meant to enforce policies on north-south traffic between the Internet and the data center. Because the traffic in the data center dramatically increased as east-west between the multiple application tiers, securing the data center from the inside is now essential. As such, policy enforcement for east-west traffic now must go through the same firewalls. This is seen in Figure 1-9. With practically all traffic going to the firewall, the firewall rules are enormous. Normally, administrators define such policies for a specific service, and when the service disappears, the policies remain in the firewall. For anyone who has worked with ACLs and setting firewall rules, it is well known that nobody dares to touch or delete a rule from the firewall for fear of breaking traffic or affecting an application. Moving into hyperconverged infrastructures and a two-tier design, you see how policy enforcement for applications shifts from being VLAN centric to application centric. Cisco’s ACI and VMware’s networking and security software product (NSX), covered later in this book, describe how policies are enforced in a hyperconverged environment.
Figure 1-9 Firewall Overload
With so many issues in legacy data centers, newer data center designs are making modifications, including these:
Moving away from the three-tier architecture to a two-tier leaf and spine architecture
Moving L3 into the access layer and adopting L2 tunneling over L3 networks
Implementing link aggregation technologies such as virtual port channel (vPC) and multichassis link aggregation (MLAG) to offer greater efficiency by making all links in the network active; traffic is load-balanced between different switches
Moving firewall policies for east-west traffic from the aggregation layer to the access layer or directly attaching them to the application
The preceding enhancements and more are discussed in detail in Part VI of this book, “Hyperconverged Networking.”