Global connectivity is challenging with many factors that create complexity in devices. Networks are not homogeneous, and sometimes testing one network does not mean you will see the same behavior in the subsequent networks. Network providers aim to follow the same standards, but from experience, it is clear that the individual implementation differs, which can cause unexpected behavior.
Sometimes a network's permanent loss of connection can be triggered from a simple maintenance window or a malfunction on a specific cell within range of the device. This article introduces a robust fall-back strategy to avoid high failure rates.
Why is this an issue?
Historically devices such as cell phones have been tied to a single operator with a small amount of roaming partners. Network roaming standards were developed under this assumption, and thus the current standards are optimized for a different generation of networks.
The introduction of true global roaming SIMs removes the assumption that one home network and several roaming partners is the optimal use case no longer holds. The 'old way' is not geared for dynamic safe-lists and ever-changing roaming partners and agreements.
Previously a device would have only a few networks it could attach to and would retry those same networks repeatedly until the device established a successful connection. While this ensures a reliable connection, it is not suited for low power devices and the flexibility introduced with next-generation roaming sims. Adding all local networks available to Onomondo users to the list of networks allowed is either not viable or possible and would result in other complications.
An example for context
Previously:
You would have a primary network (the home PLMN). When establishing a connection, the device would look for that network. If that weren't available, it would refer to a list of alternatives (roaming partners) and try those in order.
It would repeat this until there is an established connection. The lists are static, and your device might have attached to a network with a weak signal because the device prioritized the network on the static list. This method was suitable for ensuring connectivity but bad news for your battery life and operating costs.
Now:
The device attaches to the network with a strong signal - there are no predefined lists, and all networks have equal priority. Occasionally it might hit a network, not on the dynamic safelist; let's call this operator ExpensiveNetworkDK. All good.
The following month ExpensiveNetworkDK comes with a reasonable offer, and you now want only to use ExpensiveNetworkDK on your devices. Easy, right? Just update the safelist to that one network, and you're good to go.
In reality, your device will not connect to ExpensiveNetworkDK as this network has been marked as forbidden and added to the forbidden network list (FPLMN). So now, the device adds all other networks to the same list, causing it to refuse to connect to anything. It simply has no idea of what network to attach to and is effectively lost.
Modems, however, still work under the assumption that you have a fixed, prioritized list of networks available as a fallback.
The solution:
We'll introduce a robust fallback procedure to avoid the scenario outlined above. Following fallback procedures future-proofs your device and ensures that your device will be able to recover and adapt to new network safe-lists.
What can be done?
Here is our recommended procedure for troubleshooting a device's connectivity issues. We have divided this process into four steps:
1: Implementing a Recovery Trigger
Your device should be aware of both its current and historical connectivity status. This information will be necessary for implementing an accurate trigger and a proper back-off strategy.
The device should be online within 10 seconds of booting up after the initial attach. In situations where the previous network is unavailable, as the case usually is with trackers crossing borders, the connection time can take up to several minutes. Before you deploy, test thoroughly.
Note the median connection time (first attach only) and add 50%. This amount will serve as your worst-case connection time and serve as a baseline for the recovery trigger.
Estimating worst case time to attach
Note the approximate time it takes to attach to different networks in different locations initially. Find the median and use this as the T_INITIAL_ATTACH.
The median time will vary across devices, access technologies, and location. LTE-M enabled devices are significantly faster than their 2G counterparts, for example. The variance on devices with 2G fallback will likely be higher. T_INITIAL_ATTACH
will probably be in the 1-5 minute range.
Use T_TRIGGER = T_INITIAL_ATTACH x 1.5
as a baseline timeout for unsuccessful attach (and loss of connectivity).
Depending on your preference, you now have two approaches:
Optimised for battery life
If battery life is a high priority, you should back off at this point and try again later. Halting the trigger is to avoid unnecessary network scans in cases where the device is outside any coverage. If unsuccessful, refer to step 3.
Optimised for robust connectivity
If high availability is the priority, continue to step 2. By not invoking the back off timer just yet you can potentially recover a bit faster at the expense of higher energy consumption.
Before implementing this method, make sure you understand and implement a back-off strategy. A back-off strategy is essential for reducing overall power consumption when the device should not connect to a network. If this process repeats too frequently, the device is at risk of temporary bans from a local Radio Area Network (RAN).
1.1: Failure to create a data session
Additionally, you should track if the data is actually successfully offloaded as we have observed that some operators allow the device to attach but refuse to open a data session. This is bad practice but luckily less critical as you have the option to block the operator from the Onomondo platform and recover gracefully. If your device on multiple occasions fails to activate the PDP context on the same network it can be necessary to switch network. See step 2.
This option may be a more robust way to trigger the flow of data offloads when the device successfully or partially attaches but fails to offload the payload. But may increase the connection time for incidents unrelated to network connectivity.
2: Running a Full Network List Scan
Once the recovery trigger has fired, the next step is to diagnose the issue.
Run a full scan on the device using AT+COPS=?
. This command will return a list of available networks paired with the access technology and network status:
AT+COPS=?
+COPS: (STATUS,"NETWORK_1","NAME","MCCNC",TECH),(STATUS,"NETWORK_2"...
OK
If the network status for every network is marked "3", no allowed networks are available.
Diagnosis: Safelist has been updated and/or issues with the network
Fix: Clear the FPLMN list. Restart your attach procedure. Your device can now freely try all networks again.
If some networks are marked "1", then it is normally not a FPLMN issue.
Diagnosis: There are available networks, and automatic selection has failed. If your device has been unable to start the data session on a network it should be considered not-available.
Fix: Attach to the available networks until you succeed. If this fails, move on to step 3.
Why not just clear the FPLMN all the time?
In short, it's there for a reason and is still very useful. Clearing the FPLMN too often will lead to a significant decrease in performance. You are messing with the automatic network selection process (that generally does a good job), and your device will waste power and time trying to connect to bad networks.
3: Back-Off Strategy Recommendations
It's essential to have a back-off strategy if the first one or two loop triggers are unsuccessful. In some instances, the device may be in an area with no available networks, or you may want to deactivate the device (customer not paying, unusual usage activity, debugging, etc.).
During these scenarios, if the device constantly is looping unsuccessfully, it will use additional battery, and repetitive attempts on local RANs can cause temporary blocking of the SIM or IMEI.
Find a back-off strategy that works for you. As with many things, it is a compromise between power and robustness. As a rule of thumb, the time between each recovery should increase exponentially.
Examples of the time between each recovery attempt are outlined below. Use this as a reference to fit your needs.
Low power:
30 minutes
2 hours
6 hours
12 hours
24 hours
24 hours ...
High uptime, high power:
5 minutes
10 minutes
20 minutes
40 minutes
1 hour
1 hour ...
4: The Last Resort
If the recovery has been triggered too many times with no success, your modem may force itself into a bad state. This situation is rare, but implement this as a last fail-safe.
Turning the device off and on gracefully does not clear all volatile memory, and the modem can continue in an undefined state. Additionally the SIM can be marked unsuitable for service by the modem and normal operation is only resumed with insertion of a new SIM or a power cycle. At this point, it is time to pull the plug.
You have three options here:
Do a non-graceful shutdown of your modem if possible. Depending on the board layout this might not be possible. Repeat your attach procedure. If it continues to fail, move to the next step. Some manufactures do not recommend this as the modem can become bricked. Check your datasheet of your modem.
Do a soft reset. Modem often have an external reset option - either dedicated or integrated with the power control pin. Check your modem datasheet for details. Alternatively can be done through AT commands as well. Repeat your attach procedure.
If all fails do a modem factory reset.
Conclusion
If you continue to have issues reconnecting your device after following the recommended procedures, please reach out to Onomondo Customer Success. We understand that connectivity troubleshooting can be meticulous with the range of devices and use cases today. This article will continue to evolve as new procedures for initializing recovery modes develop.