Skip to content

The Jitter Bug

BUG :-)

This is an "in-the-trenches" story about how our team diagnosed a strange VOIP issue.

One of our partners used our services to provide a line for connectivity and VOIP traffic. Unfortunately, the end user reported that while using the service their phones had frequent audio break-ups when the line would go silent. The supplier providing the SIP service refused to accept that our service was not at fault. While it was not our responsibility, we decided to send an engineer on-site to figure out what was happening.

The Client Setup

The client's PBX connected to a SIP provider outside our network, and as we had no access to the PBX, we asked for access to the router to help diagnose this issue. We usually shy away from taking control of customer devices as we had cases where the customer blamed us for unrelated issues, even when we never changed configuration settings. However, we had to isolate the fault and therefore needed access to the device.

Rack View

The client used a digital PBX from NEC (highlighted in red in the above photo) with a SIP trunk through their internet connection, a leased line circuit, to connect with the PSTN network without any other PSTN breakout (i.e. no legacy ISDN2/ISDN30).

The incoming Exa Networks leased line terminated on an Openreach ADVA NTE (at the top of the photo, highlighted in green). The connectivity terminates on an Exa-managed Ubiquiti router (highlighted in green) which provided a /28 (16 IPv4 addresses) for the customer.

A small eight-port Netgear switch sat on top of the Ubiquiti router, distributing the Internet feed to two Draytek firewalls:

  • The upper, a DrayTek Vigor 2860, supplied Internet access to the school LAN (the “internet firewall”).
  • The lower, a DrayTek Vigor 2862n, supplied internet access to the PBX (the “VoIP firewall”). The PBX was the only device plugged into the LAN port on this firewall.

The client LAN and the PBX operated as separate internal private networks, and there was no direct link between these networks. Any traffic on the LAN (e.g. broadcast/multicast traffic or loops/storms) would consequently not impact the PBX operation.

The fact that the customer was using Drayteks assisted our investigation since we use Draytek extensively, and our team are very familiar with the technology.

Before tests began, the physical topology was as follows:

Initial Topology

Our Engineer's Toolbox

The testing hardware brought to the site by Exa:

  • Dell laptop running Windows 10
  • D-Link DGS-1510-20 managed gigabit switch
  • A number of lengths of Cat5e/Cat6 cable

Visiting Wonderland

Our initial test was to check basic connectivity toward the router. This was done by connecting the laptop directly to the VoIP firewall LAN. The engineer's computer obtained a DHCP lease of 10.12.218.200

First, a router configuration backup was taken for the Vigor as a safety precaution. Its configuration was then inspected, but nothing special stood out regarding the device interface configuration or bandwidth management/QoS.

Then, the Windows machine performed a packet capture using Wireshark (a popular and powerful open-source network diagnostic tool).

The PBX was not disconnected from the VoIP firewall and remained in operation as usual. Thus, both the laptop and the PBX were the only devices connected to the LAN ports on the VoIP firewall.

Initial Test Topology

Within less than a minute, many ping responses in the region of 300ms were observed from a continuous ping (using the standard Windows command line tool "ping" and -t option) to the router LAN IP 10.12.218.1, indicating a possible problem with either the router itself or the LAN side of the firewall.

A screenshot of the ping results showing high latency is as follows:

Initial Vigor ping

Continuous pings were also made to the Exa Ubiquiti Internet router. Similar results were seen for this ping traffic passing from the laptop through the VoIP firewall to the outside public network subnet and back again. "PsPing", from the Microsoft Sysinternals tool suite, was also used to provide more granular ping tests (i.e. pinging more frequently that once per second).

Initial Windows ping

The following screenshot from Wireshark shows the captured response time of 311ms to a ping from the laptop to the VoIP firewall internal.

Initial Wireshark Bad Packet Capture

Similarly, WireShark allowed us to observe a response time of 170ms through the VoIP firewall to the Ubiquiti Internet gateway, which relates to the previous ping screenshot:

Initial Wireshark Good Packet Capture

Pings between the test laptop and the PBX showed no internal LAN connectivity issues using the switch ports on the Vigor. Initial indications were that traffic passing through the CPU or to the CPU of the VoIP firewall could be impacted by an as-yet-unidentified cause.

Changing the Topology

The D-Link test switch was mounted in the rack and powered up. Next, the D-Link was introduced in-path and passing traffic between the VoIP firewall and the PBX, mirroring both traffic in both directions to the test laptop.

Second Test Topology

An inbound call was then established and the traffic was captured by Wireshark.

Wireshark Stream Analysis Call

This call had audible artefacts. RTP sequence number analysis of each direction of the call at this capture point indicated no packet loss. RTP being received from the PBX destined towards the SIP provider (i.e. before being received by the VoIP router LAN port) showed a clean 20ms pacing of traffic which is to be expected of a LAN capture (i.e. no apparent issues with the packetisation from the PBX outwards towards PSTN)

Wireshark Stream Analysis Graph

From the same capture, the inbound (PSTN to PBX) traffic analysis indicated no packet loss but there was one period of time skew in the region of 294ms at packet number 15834:

Wireshark Stream Analysis Data Wireshark Stream Analysis Graph

The Wireshark capture was inspected at packet 15834, but no other traffic was present to indicate a LAN issue (e.g. a broadcast or other “bad” packets):

Wireshark Stream Analysis Packets

At this stage, we had ascertained the following:

  • Outbound RTP audio from the PBX towards the VoIP firewall (and thus the Internet SIP provider) is clean when leaving the PBX before it hits the VoIP firewall.
  • RTP audio passed from the VoIP firewall to the PBX (thus from the Internet SIP provider) is showing a clear issue with a similar amount of transmission timing skew as the random ping latency observed in the initial laptop ping tests towards the VoIP firewall.

Down the Rabbit Hole

To identify whether the VoIP firewall is at fault, the D-Link capture switch needed to be placed one device up the chain, between the Netgear distribution switch and the VoIP firewall's outside (WAN) interface.

Third Test Topology

The testing resumed, and finally, during a 9 minutes test call, some session artefacts became audible.

Wireshark Stream Analysis Call

RTP analysis of this last session revealed a similar latency skew on RTP traffic originating from the VoIP firewall (i.e. transmitted by the PBX towards the Internet SIP provider through the firewall).

In the following example of outbound RTP, there is zero packet loss but a significant number of skews of approximately 300ms.

Wireshark Stream Analysis Graph

For the inbound RTP of the same call (i.e. traffic from the Internet SIP provider through the leased line and Netgear switch), only minimal skew occurred (the maximum was 43ms at packet 21315):

Wireshark Stream Analysis Graph

Thus, it appears that RTP passing in either direction through the VoIP firewall (i.e. port to port) was subject to periods of increased transit/transmission latency, of approximately 300ms, at seemingly random times. 300ms is enough to exhaust de-jitter buffers at either side (PBX and SIP provider), resulting in degraded call quality.

Re-visiting the outbound RTP direction analysis and cross-referencing it with the main PCAP log, the first occurrence of this skew is at packet 1237 in the capture:

Wireshark Stream Analysis Data

Highlighting this and right-clicking “go to packet” takes you to this packet in the main capture window. Near this event, an SSH key exchange to the public IP of the firewall in packet 1236 is recorded.

Wireshark Stream Analysis Packets

Looking at the next skew occurrence, we find packet 1845:

Wireshark Stream Analysis Data

Re-visiting the main capture window for packet 1845:

Wireshark Stream Analysis Packets

There is another key exchange at packet 1844. Again, looking at the next RTP skew event at packet 2508:

Wireshark Stream Analysis Data

Within the main capture, a third key exchange occurs just before at packet 2507:

Wireshark Stream Analysis Packets

Applying a filter only to display Ethernet traffic originating from the VoIP firewall (the public WAN interface MAC address), clear 300ms gap/stall in ALL packet transmission is observable from the firewall towards the Internet when these SSH key exchanges occur:

Wireshark Stream Analysis Packets

Between the packets highlighted in red above, there is a 301.474ms pause (timestamp 94.384337s minus 94.082863s) in packet transmission/forwarding from the device.

Re-visiting the first mirror capture session (taken between the PBX and the internal LAN port of the VoIP firewall), the same gap can is present around packet 15834:

Wireshark Stream Analysis Packets

Here, the stall was approximately 278ms, which is the difference in time between the two packets originating from the MAC of the LAN side of the firewall.

The Eureka Moment

Searching through the packet capture for SSH key exchanges reveals a significant number of connections:

Wireshark Stream Analysis Packets

Inspecting the configuration of the Vigor, we find that remote management for SSH, HTTPS and telnet was open to the world without any source IP restrictions.

Initial Vigor Settings

Upon this finding, the SSH service was disabled, and consequently, the VOIP issue disappeared.

Following up

With this problem identified, we decided to investigate how to protect other customers from this issue. First, we checked that this behaviour was reproducible and observable remotely. Using a test router, we could see clear spikes of around 300ms when SSH was also initiated to other Vigor routers.

The Vigor router implementation of cryptographic code required for SSH could cause interruption (stall) of approximately 300ms to packet processing/forwarding during the initiation of an SSH session to the device.

Conclusion

While it took some on-site investigation to find the root cause of the issue, we’re pleased we were able to help the customer - and glad to see that it wasn’t our service wasn’t at fault!

No vendor OS is perfect, and while it would be easy to look negatively at the Vigor router, they are excellent CPE and work very well. So well that many customers are under the impression that they can "install and forget" them.

This story should also remind everyone managing CPE on behalf of end users that they should keep an eye on CVEs (disclosed vulnerabilities). Older versions of the Vigor OS are vulnerable to a remote code execution vulnerability, which once again demonstrates why management ports should never be left open to the world.

Exa takes the security of its infrastructure seriously. We apply tight access control lists to the management ports and perform regular security scans of our infrastructure for known vulnerabilities.

Exa Networks does not fall within the scope of the new Telecommunications Security Act, but as an ISO 27001 (and Cyber Essentials) business, we already follow many of the requirements set out by the legislation. While the law places new requirements on ISPs to secure their infrastructure, it does not apply to customer devices. We can only invite our customers to update and secure their devices, as without the right to manage the equipment directly, we can not ensure that it happens.

Customers using our router management product, Router for Life, have their routers centrally configured and safeguarded from an ACS (Automatic Configuration Server), automatically keeping their equipment up to date and well secured.