The Jitter Bug

This is an "in-the-trenches" story about how our team diagnosed a strange VOIP issue.

One of our partners used our services to provide a line for connectivity and VOIP traffic. Unfortunately, the end user reported that while using the service their phones had frequent audio break-ups when the line would go silent. The supplier providing the SIP service refused to accept that our service was not at fault. While it was not our responsibility, we decided to send an engineer on-site to figure out what was happening.

The Client Setup

The client's PBX connected to a SIP provider outside our network, and as we had no access to the PBX, we asked for access to the router to help diagnose this issue. We usually shy away from taking control of customer devices as we had cases where the customer blamed us for unrelated issues, even when we never changed configuration settings. However, we had to isolate the fault and therefore needed access to the device.

The client used a digital PBX from NEC (highlighted in red in the above photo) with a SIP trunk through their internet connection, a leased line circuit, to connect with the PSTN network without any other PSTN breakout (i.e. no legacy ISDN2/ISDN30).

The incoming Exa Networks leased line terminated on an Openreach ADVA NTE (at the top of the photo, highlighted in green). The connectivity terminates on an Exa-managed Ubiquiti router (highlighted in green) which provided a /28 (16 IPv4 addresses) for the customer.

A small eight-port Netgear switch sat on top of the Ubiquiti router, distributing the Internet feed to two Draytek firewalls:

The upper, a DrayTek Vigor 2860, supplied Internet access to the school LAN (the “internet firewall”).
The lower, a DrayTek Vigor 2862n, supplied internet access to the PBX (the “VoIP firewall”). The PBX was the only device plugged into the LAN port on this firewall.

The client LAN and the PBX operated as separate internal private networks, and there was no direct link between these networks. Any traffic on the LAN (e.g. broadcast/multicast traffic or loops/storms) would consequently not impact the PBX operation.

The fact that the customer was using Drayteks assisted our investigation since we use Draytek extensively, and our team are very familiar with the technology.

Before tests began, the physical topology was as follows:

Our Engineer's Toolbox

The testing hardware brought to the site by Exa:

Dell laptop running Windows 10
D-Link DGS-1510-20 managed gigabit switch
A number of lengths of Cat5e/Cat6 cable

Visiting Wonderland

Our initial test was to check basic connectivity toward the router. This was done by connecting the laptop directly to the VoIP firewall LAN. The engineer's computer obtained a DHCP lease of 10.12.218.200

First, a router configuration backup was taken for the Vigor as a safety precaution. Its configuration was then inspected, but nothing special stood out regarding the device interface configuration or bandwidth management/QoS.

Then, the Windows machine performed a packet capture using Wireshark (a popular and powerful open-source network diagnostic tool).

The PBX was not disconnected from the VoIP firewall and remained in operation as usual. Thus, both the laptop and the PBX were the only devices connected to the LAN ports on the VoIP firewall.

Within less than a minute, many ping responses in the region of 300ms were observed from a continuous ping (using the standard Windows command line tool "ping" and -t option) to the router LAN IP 10.12.218.1, indicating a possible problem with either the router itself or the LAN side of the firewall.

A screenshot of the ping results showing high latency is as follows:

Continuous pings were also made to the Exa Ubiquiti Internet router. Similar results were seen for this ping traffic passing from the laptop through the VoIP firewall to the outside public network subnet and back again. "PsPing", from the Microsoft Sysinternals tool suite, was also used to provide more granular ping tests (i.e. pinging more frequently that once per second).

The following screenshot from Wireshark shows the captured response time of 311ms to a ping from the laptop to the VoIP firewall internal.

Similarly, WireShark allowed us to observe a response time of 170ms through the VoIP firewall to the Ubiquiti Internet gateway, which relates to the previous ping screenshot:

Pings between the test laptop and the PBX showed no internal LAN connectivity issues using the switch ports on the Vigor. Initial indications were that traffic passing through the CPU or to the CPU of the VoIP firewall could be impacted by an as-yet-unidentified cause.

Changing the Topology

The D-Link test switch was mounted in the rack and powered up. Next, the D-Link was introduced in-path and passing traffic between the VoIP firewall and the PBX, mirroring both traffic in both directions to the test laptop.

An inbound call was then established and the traffic was captured by Wireshark.

This call had audible artefacts. RTP sequence number analysis of each direction of the call at this capture point indicated no packet loss. RTP being received from the PBX destined towards the SIP provider (i.e. before being received by the VoIP router LAN port) showed a clean 20ms pacing of traffic which is to be expected of a LAN capture (i.e. no apparent issues with the packetisation from the PBX outwards towards PSTN)

From the same capture, the inbound (PSTN to PBX) traffic analysis indicated no packet loss but there was one period of time skew in the region of 294ms at packet number 15834:

The Wireshark capture was inspected at packet 15834, but no other traffic was present to indicate a LAN issue (e.g. a broadcast or other “bad” packets):

At this stage, we had ascertained the following:

Outbound RTP audio from the PBX towards the VoIP firewall (and thus the Internet SIP provider) is clean when leaving the PBX before it hits the VoIP firewall.
RTP audio passed from the VoIP firewall to the PBX (thus from the Internet SIP provider) is showing a clear issue with a similar amount of transmission timing skew as the random ping latency observed in the initial laptop ping tests towards the VoIP firewall.

Down the Rabbit Hole

To identify whether the VoIP firewall is at fault, the D-Link capture switch needed to be placed one device up the chain, between the Netgear distribution switch and the VoIP firewall's outside (WAN) interface.

The testing resumed, and finally, during a 9 minutes test call, some session artefacts became audible.

RTP analysis of this last session revealed a similar latency skew on RTP traffic originating from the VoIP firewall (i.e. transmitted by the PBX towards the Internet SIP provider through the firewall).

In the following example of outbound RTP, there is zero packet loss but a significant number of skews of approximately 300ms.

For the inbound RTP of the same call (i.e. traffic from the Internet SIP provider through the leased line and Netgear switch), only minimal skew occurred (the maximum was 43ms at packet 21315):

Thus, it appears that RTP passing in either direction through the VoIP firewall (i.e. port to port) was subject to periods of increased transit/transmission latency, of approximately 300ms, at seemingly random times. 300ms is enough to exhaust de-jitter buffers at either side (PBX and SIP provider), resulting in degraded call quality.

Re-visiting the outbound RTP direction analysis and cross-referencing it with the main PCAP log, the first occurrence of this skew is at packet 1237 in the capture:

Highlighting this and right-clicking “go to packet” takes you to this packet in the main capture window. Near this event, an SSH key exchange to the public IP of the firewall in packet 1236 is recorded.

Looking at the next skew occurrence, we find packet 1845:

Re-visiting the main capture window for packet 1845:

There is another key exchange at packet 1844. Again, looking at the next RTP skew event at packet 2508:

Within the main capture, a third key exchange occurs just before at packet 2507:

Applying a filter only to display Ethernet traffic originating from the VoIP firewall (the public WAN interface MAC address), clear 300ms gap/stall in ALL packet transmission is observable from the firewall towards the Internet when these SSH key exchanges occur:

Between the packets highlighted in red above, there is a 301.474ms pause (timestamp 94.384337s minus 94.082863s) in packet transmission/forwarding from the device.

Re-visiting the first mirror capture session (taken between the PBX and the internal LAN port of the VoIP firewall), the same gap can is present around packet 15834:

Here, the stall was approximately 278ms, which is the difference in time between the two packets originating from the MAC of the LAN side of the firewall.

The Eureka Moment

Searching through the packet capture for SSH key exchanges reveals a significant number of connections:

Inspecting the configuration of the Vigor, we find that remote management for SSH, HTTPS and telnet was open to the world without any source IP restrictions.

Upon this finding, the SSH service was disabled, and consequently, the VOIP issue disappeared.

Following up

With this problem identified, we decided to investigate how to protect other customers from this issue. First, we checked that this behaviour was reproducible and observable remotely. Using a test router, we could see clear spikes of around 300ms when SSH was also initiated to other Vigor routers.

The Vigor router implementation of cryptographic code required for SSH could cause interruption (stall) of approximately 300ms to packet processing/forwarding during the initiation of an SSH session to the device.

Conclusion

While it took some on-site investigation to find the root cause of the issue, we’re pleased we were able to help the customer - and glad to see that it wasn’t our service wasn’t at fault!

No vendor OS is perfect, and while it would be easy to look negatively at the Vigor router, they are excellent CPE and work very well. So well that many customers are under the impression that they can "install and forget" them.

This story should also remind everyone managing CPE on behalf of end users that they should keep an eye on CVEs (disclosed vulnerabilities). Older versions of the Vigor OS are vulnerable to a remote code execution vulnerability, which once again demonstrates why management ports should never be left open to the world.

Exa takes the security of its infrastructure seriously. We apply tight access control lists to the management ports and perform regular security scans of our infrastructure for known vulnerabilities.

Exa Networks does not fall within the scope of the new Telecommunications Security Act, but as an ISO 27001 (and Cyber Essentials) business, we already follow many of the requirements set out by the legislation. While the law places new requirements on ISPs to secure their infrastructure, it does not apply to customer devices. We can only invite our customers to update and secure their devices, as without the right to manage the equipment directly, we can not ensure that it happens.

Customers using our router management product, Router for Life, have their routers centrally configured and safeguarded from an ACS (Automatic Configuration Server), automatically keeping their equipment up to date and well secured.