By Dan Saul
Before we get started, it is important to note the following statement: any opinions expressed are solely my own and do not reflect the views of any current or former employers, individuals, or organizations.
Summary: I developed a Telecom SIP trunking system that can handle over 200,000 calls monthly (and growing) while being fully scalable, that has improved the service that we provide our clients considerably. This documents the process we went through.
On September 16th, 2021, VoIP.ms, a SIP trunking carrier that provided services for a large number of people, including myself and the company I worked for, was targeted by the REvil group in a distributed denial of service (DDoS) attack.
A distributed denial-of-service (DDoS) attack involves inundating a computer system with a massive volume of traffic or requests, rendering it incapable of catering to genuine clients. This can be likened to a traffic jam that congests a highway, making it impossible for you to reach your destination in a timely manner.
When network troubles occur, VoIP systems may experience a variety of issues that can impact communication quality and reliability. Dropped calls may occur as a result of a lack of stable connectivity, while poor call quality can stem from network congestion leading to delays, jitter, and packet loss. Latency can also be a problem, with high delays making conversations difficult to follow. In the event of a complete network outage, users will be unable to use the VoIP system at all.
Initially, REvil's strategy involved targeting domain name servers (DNS) to disrupt client services. This was achieved by preventing client phone systems from translating internet names (such as toronto2.voip.ms) into their respective IP addresses (208.100.60.51).
Over the course of several weeks, we engaged in a continuous tug-of-war with REvil, implementing new strategies to ensure our clients' phone services remained operational despite the changing methods of DDoS attacks.
Initially, we were able to mitigate the issue by directing clients to directly use the server's IP addresses instead of DNS names. However, REvil's persistence led them to escalate the attack by directly flooding VoIP.ms with traffic, bypassing the supporting infrastructure altogether.
In the past, our approach was to sell services from other VoIP providers. This strategy was practical for us, as reselling services is a straightforward process. We would establish the services and then provide assistance whenever there were any problems. We had been following this approach for several years before the attack, and although no system is completely error-free, it had been effective for us.
We developed a skill in transferring our client base to alternative servers as the attacks shifted from one server to another. Our technique involved creating custom DNS names that facilitated swift server changes. Although it did not eliminate the interruptions caused by the attacks, it enabled us to rapidly relocate our clients without having to manually adjust each PBX utilizing these trunks.
Nevertheless, this was an arduous cat-and-mouse game that caused distress to us and our clients. The DDoS attack persisted for several weeks, resulting in client attrition that was understandable. We recognized that this situation was unsustainable.
VoIP.ms remained unresponsive during the incident and offered empty reassurances without taking any concrete actions. In my personal experience, it took three weeks and a complaint to the CRTC for them to address my request to port out my numbers. While VoIP.ms eventually placed their servers behind Cloudflare to mitigate the DDoS attack, intermittent outages are still present at the time of writing. However, to this day, they have not conducted a proper postmortem analysis of the incident, which is a fundamental responsibility of any Infrastructure as a Service (IaaS) provider in the event of significant downtime.
To me, it was apparent that the current situation was not feasible in the long term. In order to manage our clients' experiences effectively, we had to take charge and host them directly. Our ability to be flexible and adapt to unforeseen circumstances was crucial.
In order to achieve this goal, we must develop a telecom trunking system on the back-end that satisfies the following requirements, listed in no particular order:
For more than fifteen years, I have been involved in software development, and for over ten of those years, I have been working as a consultant. In situations where a solution is not readily available, you must create it.
With that objective in mind and considering the previously mentioned requirements, I developed the subsequent service architecture:
To ensure optimal service, our platform necessitates the use of multiple upstream providers, preferably Tier 1 providers who own their infrastructure, rather than reselling services. Our previous VoIP offerings are now classified as Tier 2 providers.
In order to seamlessly port numbers, we require connections with all existing upstream providers.
To prevent unauthorized access, providers have dedicated points of presence that are inaccessible to outsiders.
Our infrastructure is configured with leased virtual machines on at least two providers, offering both organizational and geographic redundancy.
Each client is assigned a unique point of presence, so that if a malicious client engages in a DDoS attack, it would only affect their own service. If this IP is discovered, we can easily terminate the Virtual Machine and create a new one with a different IP.
All nodes employ IP whitelisting to restrict access to authorized personnel only.
I invested considerable effort into designing the carrier-grade trunking system, focusing on utilizing open-source software with appropriate licenses to reduce the need for extensive development and maintenance. My aim was to leverage as much existing technology as possible in order to streamline the setup and operation of the system.
I have chosen Asterisk as my SIP software of choice, while many carriers prefer OpenSIPs or Kamilio (which are forks of the same original software) with RTP proxy for their SIP services. However, these carriers often use Asterisk anyway in the back-end for call routing services.
When these carriers started offering their own services, this was a good choice. Asterisk had a homegrown implementation of SIP called “chan_sip” that had a lot of issues. However, starting with Version 13.8 of Asterisk, they implemented PJSIP, an industry standard SIP library that alleviates most of the previous issues afflicting chan_sip.
You might ask me, why would I choose to use Asterisk when other providers have demonstrated that other open source software stacks work? To answer this, we have to take a closer look at Kamilio. An important consideration that I had to make when choosing the software is how my organization can support it.
At my previous organization, we have numerous system administrators as we are a managed service provider. However, I was the sole software developer. Working with Kamilio, one needs to possess a deep understanding of SIP protocols and their customization/routing, working directly with packets. This is a significant departure from a typical PBX system, where you mainly consider which extension is dialing another. This is a more straightforward concept for non-software engineers to grasp. It is my responsibility to ensure that in the event of my absence, others can take over my role.
We would require in-house development for all the provisioning systems, along with a user-friendly front-end for the system administrators. Fortunately, I have prior experience in creating administrative interfaces, making it simple to transfer that knowledge and expertise to this particular project.
The installation of Asterisk is directly performed on the nodes, while all ancillary software is deployed using Docker. This approach enables swift and effortless deployment and updates to all nodes, while also facilitating the undoing of changes, thus affording protection and prompt recovery in the event of a flawed software version being deployed.
The C# and DotNet were utilized in building the back-end, which has been a pleasure to work with ever since Microsoft made the framework open source. Compared to other languages and APIs I've worked with, it feels more like a complete package with all the necessary tools included. This has allowed me to concentrate on accomplishing tasks rather than dealing with redundant code.
I used Typescript to program the administrative front-end because, just like with C#, having type-checking available helps to catch many trivial bugs that would otherwise only be detected at runtime.
I opted for Vue as my JavaScript framework and Vuetify for interface elements, both of which have been a delight to work with. They enable me to focus on achieving my goals at a higher level rather than having to implement every single detail myself.
After manually setting up the system, I was able to prepare the minimum viable product (MVP) within a few weeks. The initial results looked very encouraging, although we did encounter some minor problems with certain obscure PBXes that needed to be addressed. Fortunately, resolving these issues was not too difficult. As a result, we were able to accelerate our efforts and work tirelessly to complete the entire system.
It took an additional month of long days and testing to extend the MVP's scope beyond our in-house system, but I have completed and prepared it.
Ensuring successful emergency call completion for 911 calls was a top priority, given the challenges with E911 service in Canada. Unlike in the US, where a Centralized Emergency Service ID (CESID) is registered in a central database, E911 in Canada relies on the caller ID number sent by the PBX. Furthermore, Northern 911, a government monopoly, manages E911 in Canada and charges a fee for each registered number.
To avoid being disconnected during an emergency call, it's crucial to send the exact information required by the upstream provider. However, different providers have different requirements for the format of the caller ID number.
If you fail to send the precise information that the upstream provider requires, they may disconnect the call. This can be particularly dangerous during emergency situations. For instance, certain providers may demand the transmission of 10-digit numbers (e.g., 2045551234) and not 11-digit numbers (e.g., 12045551234) or E.164 format numbers (e.g., +12045551234). It is important to note that different providers have different requirements, and failing to comply with them can result in a disconnected call. Additionally, it is imperitive that the number be sourced from the trunk provider it is coming in on. It's therefore necessary to ensure that 911 calls with a specific caller ID are routed through the correct trunk.
To reduce costs and ensure accuracy, we typically register only one number per physical location with Northern 911, using it as the default 911 number for all calls going out. Our system cascades through different levels of registration to determine which 911 number to use, starting with the phone number registration, then trunk registration, and finally the global default for the client account. We aim to prevent emergency call disconnections, minimize costs, and provide accurate information to emergency services.
After preparing our service for production, our next step was to transition clients onto it. We adopted a gradual approach by migrating a few hosted clients each day initially, as we were already familiar with their configuration and could easily monitor and test it. This continued until all clients had been transferred. We encountered no problems throughout this process, and the implementation was executed seamlessly.
We initiated the transfer of numbers from Tier 2 providers to Tier 1 providers. Although it was a time-consuming process, we proceeded cautiously to minimize any disruptions. The only problems that occurred after deployment were unplanned outages from VoIP.ms.
As of the time of writing, the migration has advanced to the point where our hosted system accommodates 90% of our clients. The system has been operating exceptionally well.
I take great pride in my creation. On several occasions, clients faced issues, but when we switched them to the new trunking system as the initial troubleshooting step, the problems were resolved. It is a gratifying experience to have designed a system capable of handling over 200,000 calls every month without any hitches. I am eagerly anticipating the milestone of 400,000 calls per month.
If you have any questions, want to discuss this article, or get in touch with me for any other reasons, please drop me a line .