Archive for May, 2008

The Network is Down? How did that happen?

May 21, 2008

You’ve designed a multi-path, fault-tolorant network.  You think you’re protected, right?

Not necessarily.

I will sheepishly admit that I got caught a many years ago (before PathSolutions) when I designed a highly reliable network and thought that my network was bulletproof.

I ran a network with three buildings in a campus, each one on its own subnet with links between each building.  I figured if one link went down, the traffic would be carried by the alternate link until we could diagnose and fix the outage.

I set up the monitoring software to ping various devices to make sure that they responded, and if a response failed, it should alert someone.

Things went well for a few months.

One morning, building 1 lost connectivity to the other buildings — a complete outage.  I looked at both building 1 links and discovered that one link lost its connectivity just a few minutes ago, and the alternate link lost its connectivity over 2 weeks ago.

How could this happen?  I thought I had monitoring to make sure this would never happen?

Using PING (ICMP echo) to test network connectivity is fine when you only have one path to a destination.  If that path is lost, you’ll get an alert.

If you have multiple routes to reach a destination, the PINGs will still go through even if the primary (or backup) route is down.  Thus, you’ll only get the notification if BOTH links go down.

This is the point where I learned that you need a better way to track this: Monitor the actual interface status and alert based on its status change.

Embarrassment avoided!

Why is the Network so SLOW?

May 15, 2008

This is the age-old complaint of users everywhere.

Network administrators struggle to answer the question because they typically don’t have the proper tools deployed.  It’s also one of the biggest impacts on business productivity, yet the problem typically goes unsolved.

The causes of slowdowns come from the following areas:

  • Over-subscribed network link
  • High levels of packet loss on a link
    • Misconfiguration (duplex mismatch, collisions, incorrect Qos)
    • Poor cabling (RFI noise, CAT3 cabling used where CAT5e is required)
    • Hardware fault (damaged/broken interface)
    • High utilization levels can also lead to packet loss
  • Slow responding server

When a slowdown occurrs, most network engineers immediately look at their WAN links to see if they are over-utilized.  They may employ a sniffer or packet analyzer to see if there are a flood of packets from any specific location to any other location, trying to make some sense of a huge pile of packets.

This attempt typically resolves in “guessing” at the problem, or worse yet: recommending upgrades where facts have not been obtained.

In order to accurately determine why the network is slow is to continuously monitor all of the links in the network and determine which ones are over-utilized and/or are discarding packets.  That eliminates the guesswork, and provides the needed facts of where the problem is, and why it’s occurring.

Why do Duplex Mismatches Occur?

May 2, 2008

This is a really old story, but it needs telling.

10Base-T was originally designed to be a half-duplex medium, where everyone shared the 10mbps bandwidth.  You would plug a number of computers into a hub and know that CSMA-CD (Carrier Sense Multiple Access with Collision Detection) would handle any resource contention.

Collisions were part of the design, and were an accepted norm.

In 1989, a company called Kalpana looked at this situation and figured that collisions could be dramatically reduced if the “hub” was able to act as an ethernet bridge instead.  They created the first multi-port bridge and decided to name it something completely different: An Ethernet Switch.

The name caught on, and many companies decided they wanted to reduce the number of collisions they were having on their networks.

Switches became very popular as a network upgrade because no configuration was required — just unplug the hub and drop in a switch and BAM!  Instant upgrade.

Along the way, the Ethernet card and switch manufacturers realized that something interesting could be done with the medium: Both sides could be configured to talk at the same time and collisions could be completely eliminated.  This “full duplex” mode of communication sped things up even further.

At that time, the IEEE (this is the standards body that maintains the Ethernet specifications)  did not include support for this full-duplex mode, so it was not well implemented by companies that wanted to offer this capability.

Duplex had to be manually set on adapters as well as switches.  It was tedious to make sure that all connections were set properly, so companies responded to this with an “auto-configure” of duplex for the adapters and switch interfaces.

The auto-detection of interface speed (10megs, 100megs, 1000megs) was easy to accomplsh because the interface could sense the frequency that was being transmitted from the other side.

Auto-detection of the duplex setting of the remote side was quite complicated.  The remote side would transmit a carrier signal superimposed on top of the 10meg or 100meg ethernet signal and hope that the receiving side would be able to properly “read” the signal.

The problem with this is that many adapters saw the superimposed signal as “line noise” and would filter it out rather than attempt to interpret it.

Since there were no standards defined to guide companies, problems occurred.

The switch port would “auto-configure” for full-duplex, and the NIC would default to half-duplex after it failed to read the superimposed signal.

This created an ugly situation: One side would transmit blindly thinking that it was safe to do so, and the other side would see massive collisions and CRC errors (because it was still attempting to respect the CSMA-CD ruleset).

The industry has not yet solved the problem (their solution is to recommend upgrading to Gigabit switches everywhere — Gigabit is always full-duplex and doesn’t have these problems!)

Gigabit isn’t the answer in many cases, as there’s currently no need for Gigabit to extend to the destkop except in rare cases (CAD design stations for example).

Another problem is that many offices are deploying VoIP phones that require Power over Ethernet (PoE).  Gigabit Ethernet requires four pairs of wire to communicate.  This means that gigabit and PoE are incompatible because there are only four pairs of wire in standard cable systems (10/100 only requires two pairs, and PoE requires two pairs so they can both co-exist in one cable).

Switch manufacturers have added support for PoE in Gigabit environments by adding “phantom power” to two pairs that also carry data signal.  This works well in resolving the problem of having PoE co-exist with Gigabit to the desktop, but these switches are currently quite pricey.

This means that 10/100 Ethernet is destined to remain in use for the forseeable future, and duplex mismatches will also remain a problem in networks.