Have More Eureka Moments in your Life

September 3, 2009

Have you ever been researching a problem for hours on end and then finally stumble across the source of the problem and then scream “Eureka!” and quickly solve the problem?  Remember how good you felt as a result?

 I’d be willing to bet that you’d want to have more of those moments (but without the hours of painstaking research and without your users or boss looming over you impatiently waiting for a solution), right?

What if I told you that you could have complete visibility into your entire network — not just a few core links, and not just utilization and an “errors” graph, but complete knowledge of when network faults occur, where they occur, and exactly why they occur?  My guess is that you’d become addicted to such a solution because you could solve a ton of problems in your network and look and feel like the expert that you are.

Total Network Visibility is the ability to know when network faults occur in your network, and be able to quickly locate where and why it happened.

Your next “Eureka” moment is just a few minutes away….

Advertisements

IT Cost Savings Through Bloatware Elimination

August 21, 2009

Compare the following two software scenarios:

Software package #1: 6meg download requires a Pentium 200mhz system with 64megs of RAM and 500megs of hard disk space.  It runs well on any Windows OS, with any service pack, and doesn’t require any additional licensing or external packages to accomplish its goal.  It will work well on a shared server or virtual server.

Software package #2: 400meg download requires a dual-core 2ghz system with 2gigs of RAM, and 2gigs of hard disk space.  It must run under Windows Server 2003 or 2008, and also requires a database license (database potentially installed on a secondary server due to loading).  Virtualization is possible but not recommended due to the fact that its design has it constantly hungry for more resources.  Appling service pack updates and/or .NET library updates may require this package to have a patch applied.  Database maintenance and patching should be performed to insure security and continued operation.

Which scenario costs more in terms of:

  • Server footprint?
  • Deployment cost?
  • Ongoing engineering support cost?
  • Air conditioning & power costs?

Let’s add some more details:

Software package #1 scales to monitor a network of 30,000 nodes with a single deployment, and collects 27 different data elements for each monitored interface.

Software package #2 is designed to monitor 8,000 nodes with a single deployment, and collects 13 different data elements for each monitored interface.

Now which software package do you want to run on your network?

A New Spin on NetFlow

August 8, 2009

Netflow as a technology originally seemed to be really cool: You could identify who’s hogging the bandwidth and then go clobber them.

The problem is that Netflow (and all of the other flow protocols jflow, sflow, etc.) are “after the fact” reporting mechanisms:  The router only transmits a flow record after the 40gig download has completed (2 hours after the network slowdown started).

Some products claim to show “real-time” netflow records, but they’re only showing you the records as they arrive in their system (still after-the-fact).

Thus, when the network slows down, Network engineers still have to set up analyzers and a span ports to hunt down who’s stealing the bandwidth and what they are doing.  This process takes a skilled engineer and at least 10 minutes to set up and start analyzing packets.

Since the problem of “why is the network slow” isn’t solved with existing Netflow implementations, I figured there has to be a better way.  With a bit of poking, prodding, and painstaking research, I discovered that there WAS a better way: Read the LIVE flow-table INSIDE the router!

That’s how we coded SwitchMonitor’s Netflow solution.  This gives you a number of advantages over the “netflow collector” type of solution:

  1. You get to see LIVE flows and instantly know who’s stealing the bandwidth and what they are doing
  2. You can monitor HUNDREDS of interfaces with our Netflow solution (other “collectors” require gobs of disk space, CPU performance, and RAM to be able to monitor just 5 interfaces!)
  3. When we send a high utilization alert email, we send the Netflow information along with the alert.  This gives you the alert as well as WHO is taking the bandwidth and WHAT they are doing

We posted a video explaining how this solution can benefit your network on our website: www.pathsolutions.com/products/video.htm

We do recognize that some companies may still want the historical Netflow records.  That’s fine.  There’s tons of historical Netflow solutions on the market.

When you’re interested in solving the problem for live flows, or for many interfaces, we’re the only game in town because all of the other solutions seem to have missed answering the original question: “What’s slowing down the network RIGHT NOW?”

Why Traffic Reporters Don’t Report from Inside the Car

July 21, 2008

The next time you’re stuck in a traffic jam, consider how similar the freeway system is to your VoIP network.  You are data inside a packet (car).  You’re trying to get to and from work each day, in real time.

 

Since you’re inside the car, you have little or no ability to see further down the road to determine if you’re going to hit a roadblock or traffic jam.

 

A packet analyzer would be similar to a toll booth where all traffic passes through and the toll taker looks inside the car to determine the occupants.  The toll taker has the same limitations as you: They cannot see further down the road to help determine if you’re going to encounter a problem.

 

Some VoIP troubleshooting tools will send test cars on the network to see how long it takes to get from one location to another.  Again, this doesn’t help locate where the traffic jams are, or what caused slowdowns.

 

In order to locate problems on the freeway system, traffic reporters determined that they needed to get out of the car and get an eagle’s eye view of everything.  From an airplane, they can see all of the intersections, collisions, and traffic conditions everywhere.

 

Freeway agencies determined that they could not afford to keep traffic copters in the air 7x24x365, so they developed systems that gave them the information continuously with a low operational cost: Traffic loops and freeway cameras.

The traffic loops and cameras were deployed throughout the freeway system so if there is a slowdown of traffic, they could quickly use a camera to see what the problem is and remedy it.

 

On a VoIP network, if you want to use a packet analyzer (toll booth) to see the condition of the network, it’s not going to provide you with enough information to solve problems (but it will be easy to determine how many cars have velvet interiors!)

 

If you intend to use simulated traffic (test cars), you’ll also be stuck with a lack of information as to where and why the VoIP network is not performing correctly.

 

Most all networks have built-in traffic loops and cameras on all interfaces.  Managed switches and routers have been deployed far and wide, but little analysis has been done to collect and analyze this information due to the complexity of SNMP.

 

PathSolutions’ SwitchMonitor breaks through this complexity by providing you visibility into the performance of every network interface.  Sources of traffic jams and packet loss can be quickly and easily remedied because the right information is available to you whenever required.

 

Get out of the car, and into a vehicle that is capable of seeing the big picture of your network: SwitchMonitor VoIP.

The Network is Down? How did that happen?

May 21, 2008

You’ve designed a multi-path, fault-tolorant network.  You think you’re protected, right?

Not necessarily.

I will sheepishly admit that I got caught a many years ago (before PathSolutions) when I designed a highly reliable network and thought that my network was bulletproof.

I ran a network with three buildings in a campus, each one on its own subnet with links between each building.  I figured if one link went down, the traffic would be carried by the alternate link until we could diagnose and fix the outage.

I set up the monitoring software to ping various devices to make sure that they responded, and if a response failed, it should alert someone.

Things went well for a few months.

One morning, building 1 lost connectivity to the other buildings — a complete outage.  I looked at both building 1 links and discovered that one link lost its connectivity just a few minutes ago, and the alternate link lost its connectivity over 2 weeks ago.

How could this happen?  I thought I had monitoring to make sure this would never happen?

Using PING (ICMP echo) to test network connectivity is fine when you only have one path to a destination.  If that path is lost, you’ll get an alert.

If you have multiple routes to reach a destination, the PINGs will still go through even if the primary (or backup) route is down.  Thus, you’ll only get the notification if BOTH links go down.

This is the point where I learned that you need a better way to track this: Monitor the actual interface status and alert based on its status change.

Embarrassment avoided!

Why is the Network so SLOW?

May 15, 2008

This is the age-old complaint of users everywhere.

Network administrators struggle to answer the question because they typically don’t have the proper tools deployed.  It’s also one of the biggest impacts on business productivity, yet the problem typically goes unsolved.

The causes of slowdowns come from the following areas:

  • Over-subscribed network link
  • High levels of packet loss on a link
    • Misconfiguration (duplex mismatch, collisions, incorrect Qos)
    • Poor cabling (RFI noise, CAT3 cabling used where CAT5e is required)
    • Hardware fault (damaged/broken interface)
    • High utilization levels can also lead to packet loss
  • Slow responding server

When a slowdown occurrs, most network engineers immediately look at their WAN links to see if they are over-utilized.  They may employ a sniffer or packet analyzer to see if there are a flood of packets from any specific location to any other location, trying to make some sense of a huge pile of packets.

This attempt typically resolves in “guessing” at the problem, or worse yet: recommending upgrades where facts have not been obtained.

In order to accurately determine why the network is slow is to continuously monitor all of the links in the network and determine which ones are over-utilized and/or are discarding packets.  That eliminates the guesswork, and provides the needed facts of where the problem is, and why it’s occurring.

Why do Duplex Mismatches Occur?

May 2, 2008

This is a really old story, but it needs telling.

10Base-T was originally designed to be a half-duplex medium, where everyone shared the 10mbps bandwidth.  You would plug a number of computers into a hub and know that CSMA-CD (Carrier Sense Multiple Access with Collision Detection) would handle any resource contention.

Collisions were part of the design, and were an accepted norm.

In 1989, a company called Kalpana looked at this situation and figured that collisions could be dramatically reduced if the “hub” was able to act as an ethernet bridge instead.  They created the first multi-port bridge and decided to name it something completely different: An Ethernet Switch.

The name caught on, and many companies decided they wanted to reduce the number of collisions they were having on their networks.

Switches became very popular as a network upgrade because no configuration was required — just unplug the hub and drop in a switch and BAM!  Instant upgrade.

Along the way, the Ethernet card and switch manufacturers realized that something interesting could be done with the medium: Both sides could be configured to talk at the same time and collisions could be completely eliminated.  This “full duplex” mode of communication sped things up even further.

At that time, the IEEE (this is the standards body that maintains the Ethernet specifications)  did not include support for this full-duplex mode, so it was not well implemented by companies that wanted to offer this capability.

Duplex had to be manually set on adapters as well as switches.  It was tedious to make sure that all connections were set properly, so companies responded to this with an “auto-configure” of duplex for the adapters and switch interfaces.

The auto-detection of interface speed (10megs, 100megs, 1000megs) was easy to accomplsh because the interface could sense the frequency that was being transmitted from the other side.

Auto-detection of the duplex setting of the remote side was quite complicated.  The remote side would transmit a carrier signal superimposed on top of the 10meg or 100meg ethernet signal and hope that the receiving side would be able to properly “read” the signal.

The problem with this is that many adapters saw the superimposed signal as “line noise” and would filter it out rather than attempt to interpret it.

Since there were no standards defined to guide companies, problems occurred.

The switch port would “auto-configure” for full-duplex, and the NIC would default to half-duplex after it failed to read the superimposed signal.

This created an ugly situation: One side would transmit blindly thinking that it was safe to do so, and the other side would see massive collisions and CRC errors (because it was still attempting to respect the CSMA-CD ruleset).

The industry has not yet solved the problem (their solution is to recommend upgrading to Gigabit switches everywhere — Gigabit is always full-duplex and doesn’t have these problems!)

Gigabit isn’t the answer in many cases, as there’s currently no need for Gigabit to extend to the destkop except in rare cases (CAD design stations for example).

Another problem is that many offices are deploying VoIP phones that require Power over Ethernet (PoE).  Gigabit Ethernet requires four pairs of wire to communicate.  This means that gigabit and PoE are incompatible because there are only four pairs of wire in standard cable systems (10/100 only requires two pairs, and PoE requires two pairs so they can both co-exist in one cable).

Switch manufacturers have added support for PoE in Gigabit environments by adding “phantom power” to two pairs that also carry data signal.  This works well in resolving the problem of having PoE co-exist with Gigabit to the desktop, but these switches are currently quite pricey.

This means that 10/100 Ethernet is destined to remain in use for the forseeable future, and duplex mismatches will also remain a problem in networks.

Latency, Jitter, and Loss – Oh My!

April 10, 2008

With apologies to L. Frank Baum

 

Dorothy was scared of the creatures that lived in her network forest.

 

“I’m trying to get to Emerald VoIP City but am scared of the Latency, Jitter and Loss that exists on the road” Dorothy exclaimed to the Tin Woodsman IT consultant.  “How will we reach Emerald VoIP City with all of these problems?”

 

The Tin Woodsman was very familiar with this concern, as many before Dorothy had this same worry.  “The Latency, Jitter and Loss that you fear are valid concerns, Dorothy”  The woodsman said.  “They can cause problems with reaching Emerald VoIP City, but they are only indicators or symptoms of the real problems.”

 

Dorothy looked perplexed with the response.  “You mean I shouldn’t fear Latency, Jitter, and Loss on the road?”

 

“No.  Latency, Jitter, and loss are valuable indicators, but they cannot lead you to the solution.  You need to look beyond these symptoms for the true cause of problems.”  The woodman said with confidence.

 

Dorothy picked up her dog Toto and looked plaintively into the woodman’s eyes “What should I do?”

 

“Simple.”  The woodsman replied.  “Ask the road.”

 

Dorothy looked confused again.

 

“The road knows what is happening.  It knows all about the problems, you just have to ask it” said the woodsman.

 

Dorothy looked around and noticed that there were yellow bricks lying all over the place, and potholes marred the previously beautiful road.  Just up the road there was a horrendous traffic jam of merchants with their carts trying to get through an intersection.

 

“I can see some of the problems right here!” Dorothy exclaimed.

 

The tin woodsman knew that Dorothy was still a bit short-sighted with her vision.  “What about the problems further down the road that you can’t see?  How will you uncover those problems?”

 

“Oh, I didn’t think of that.  I suppose I’ll try your method” said Dorothy.

 

“Yellow brick road, what are your problems?” Dorothy asked sheepishly.

 

The road hasn’t been spoken to in ages, so it was surprised with the question “I’m glad someone finally asked me!  I have potholes from Munchkin City to the scarecrow farm.  There are too many carts on the road between the scarecrow farm and the woods.  There is only one-way traffic from the woods to the poppy fields.”

 

Dorothy almost dropped her dog.  “Wow!  I now know all of the real problems and can have them easily fixed!”

 

“I’ll have the road replaced from Munchkin City to the scarecrow farm.  That should fix the loss problem on that section of the road.”

 

“To fix the traffic problem on the road between the scarecrow farm and the woods, I’ll put an additional lane in place that should only be used for the high priority carts.”

 

“To fix the one-way traffic problem from the woods to the poppy fields, I’ll change that section of road to be two-way.”

 

The tin woodsman felt good, knowing that Dorothy would soon have what she wanted – an easy path to Emerald VoIP city.

 

Dorothy smiled with satisfaction “I’ll never be scared of Latency, Jitter, and Loss again, as I now know how to uncover the true cause of VoIP problems”

 

Network Instrumentation: Sniffing, Simulating, and Sampling

March 19, 2008

All networks have faults.  It’s part of an operational network.

Knowing which faults will affect your business and when is key to being a network professional.  That’s why network instrumentation is important both from a business level and a technical level.

There are three different methods to instrument a network: Sniffing, Simulating, and Sampling.

Sniffing

Sniffing involves looking at individual packets as they pass through a network analyzer.  This type of instrumentation is valuable for seeing protocol problems or looking at specific fields inside specific packets.

If a computer cannot get a DHCP IP address, it may be beneficial to try sniffing the connection to determine what the problem is.  It will show the transmitted DHCP request, and you can then see if a response is received or not.

Pro: Deep packet level inspection.

Con: Only looks at packets passing through one interface at a time.

Usage: Typically employed ad-hoc.

Training: Must understand how network protocols interoperate.

Simulating

Simulating involves creating a simulation of the event you are trying to debug and watching operational characteristics of the simulation.

If you simulate an HTML transaction from the server’s console (ie: no network involved) and it responds quickly, but the same HTML transaction responds very slowly from a remote network, that creates proof that the network is causing the slowdown for the HTML page.  No matter what performance improvements are done on the server, it won’t help resolve the problem.

Pro: Typically quick and easy to deploy.

Con: Limited ability to see WHERE the problem lies, or WHAT is causing it to occur.

Usage: Typically employed ad-hoc.

Training: Little or none required.

Sampling

Sampling involves querying the network for performance characteristics on a regular basis.  This can allow for correlation between a simulation and the root cause of the issue.

In the former example of the slow loading HTML page, if it has been determined that the network is causing the slowdown, it will need to be determined where in the network the slowdown is coming from and what is causing it.

If all of the network links are sampled for performance information on a regular basis, it will be easy to look at the performance of each link utilized in the transaction to determine if errors or over-utilization caused the performance problem.

Pro: Can determine the exact location of problems and specific cause.

Con: Typically requires a great deal of deployment effort.

Usage: Continuously monitors network conditions.

Training: Requires SNMP knowledge as well as specific device MIBs and OIDs.

These three troubleshooting methodologies disclose different types of information to be disclosed about a networks’ operation.

Make sure you choose the right tool to solve the problem or you may be stuck looking at the problem from the wrong angle and not be able to get resolution in any reasonable timeframe. 

“There’s gold in them thar hills!”

February 21, 2008

As I drive down highway 101 through Silicon Valley, I look at the buildings around me and think about the switches and routers running each company’s network.  They each are continuously collecting statistics about their operation, yet nobody who works for the companies have any knowledge of these health indicators until something crashes.

Most network switches and routers support SNMP, but very few companies are able to gain any benefit from this capability.

There are three main reasons for this:

  1. Training required to understand how SNMP works
  2. Research required to determine which specific OIDs should be monitored
  3. Massive amount of bugs that exist in device SNMP implementations

Many network engineers never achieve understanding of SNMP due to its complexity.  They are tasked with a multitude of business driven projects, but are rarely able to focus on improving their network’s management infrastructure so their network runs smoother.

Once SNMP as a technology is understood, one would need to determine which MIB files a specific device running a specific OS version supports.  Then, the OIDs from ASN.1 formatted MIB files need to be determined.  Even with a MIB browser, it can still be difficult to determine which variable is going to provide the information you want.  (Gaaaacckkk!!!  Too many three letter acronyms!)

Now that you know which variables to query, you run straight into the bugs that exist in SNMP.  Various equipment manufacturers have taken liberties with the SNMP standards to their benefit.  Sometimes it’s as simple boastful marketing, like “Yes, we do currently support the BRIDGE-MIB, but it’s going to be released in a future release.”  (ie: they don’t current support it, but they’re saying they do).

Sometimes there are bugs that at best make your job tougher, and at worst crashes the box.

Occassionally there are manufacturers who decide that they don’t want to fully support a MIB the way it’s defined and they change the rules to suit their own needs.

With all of these difficulties, its no wonder that more companies have difficulty getting valid information on their network.