[sipX-dev] Multi-branch strategy and the backup registrar
Folks,
For a while now I have been thinking about the proposed multi-branch
strategy. I am trying to convince myself that the backup registrar is
the way to go but I had this feeling in my head that wouldn't go away
that this was a complex solution that yielded limited benefits. The
attributes of the backup registrar solution that bother me the most are:
* Creation of a new software component and a new role means increased
complexity of the solution (code, deployment & management)
* Backup registrar consulted for every call to users in the domain
* Increases the messaging load on the system. For example, a call from
a user in branch A to user in branch B, sipX at branch A will have to
send out a minimum of 2 extra INVITEs and ACKs and deal with
corresponding responses. The called user will also receive 2 INVITEs,
one being a loop of the other on every such call
* Carries risk as we start using new tricks: handling of incoming 302's
from outside; DB replication arrangement at the Backup registrar;
subjecting phones to looped messages on every incoming call from users
in other branches
But all this is a moot point if the backup registrar can be proven to
significantly increase the availability of the solution and make the
solution much more resilient.
The increased availability associated with the introduction of the
Backup registrar is what I have been focusing on in the last few days.
My main objective was to compare the availability performance of a
'backup-registrar-based solution' against a 'slim' approach. That
'slim' approach only uses a few attributes of the currently-proposed
solutions which are: central management, the concept of a home base for
a user and the ability to push home-base aliases for users at other
branches to all the sipXecs systems in the multi-branch deployment.
To carry out the comparison exercise, I used what I believe to be a
plausible multibranch network deployment and defined a series of failure
scenarios within that deployment. For each scenario, I noted which
calls were possible and which ones became impossible to make for both
the 'backup-registrar solution' and the 'slim' approaches. Because I'm
not and ASCII artist, I decided to do this portion inside a Google Docs
spreadsheet. In order to be able to understand the rest of this
discussion, you will need to first familiarize yourself with
http://spreadsheets.google.com/ccc?key=0AqS9Vc1k8ubKcjZTd3VZdzVadnZDaVpx
Q2Zna0ZaeXc&hl=en
Assuming that the network topology used is realistic and that the
results presented in the spreadsheet are accurate (comments welcomed),
this shows that out of the 54 scenarios evaluated, there are only three
that are possible under the 'backup-registrar' solution while impossible
under the 'slim'; for all other considered scenarios, the solutions are
equivalent. Before to dive into the details, I would like to point out
that the list of scenarios evaluated is obviously not exhaustive but I
tried to come up with all the scenarios that I thought could highlight
differences between the two solutions (that's the whole point of the
exercise). If I overlooked some, please flag them to me and I'll be
happy to evaluate them.
Scenario #1:
The first scenario evaluated is a case where Branch A and Branch B are
both unreachable because of network failures. With such a failure,
Things are pretty bad where 14 out of the 18 use cases fail for both
solutions. 3 out of 18 work for both solutions and one use case only
works with the 'backup-registrar' solution. More specifically, the
'backup-registrar' solution gives ability for a roaming Branch A user to
receive a call from a user at another branch while the networks to
branch A and B are both down but this comes with limitations. The user
will be able to receive a call until its registration expires which can
be anywhere from 1 second to 1 hour after which calls will fail. Also
the user cannot be called by its alias and call with not do
find-me/follow-me nor go to voicemail if unanswered.
Scenario #2:
Branch A is unreachable because of network failures but Branch B is
unaffected. In such a scenario, both solutions perform identically
where about 50% of the use cases work.
Scenario #3:
Impairment: SipX A is down but network to Branch A is operational - SipX
B and Branch B network are both operational. In such a scenario, both
solutions perform identically where almost all use cases work.
Scenario #4:
Impairment: SipX A and SipX B are both down but networks to both Branch
A and B are operational: In this scenario, things are pretty bad where
16 out of the 18 use cases fail however two work for the
'backup-registrar' while they fail for the 'slim' solution however the
limitations already stated will apply to these cases as well, i.e. the
user will be able to receive a call until its registration expires which
can be anywhere from 1 second to 1 hour after which calls will fail.
Also the user cannot be called by its alias and call with not do
find-me/follow-me nor go to voicemail if unanswered.
What does this all mean? Well, everybody is free to draw their own
conclusions, but I will share mine. Perhaps I completely missed the
point and that my understanding of the proposed 'backup-registrar'
solution is flawed but I find that the performance of the two solutions
are very similar. The 'backup-registrar' outperforms the 'slim'
solution in only 3 or the 54 uses cases. Each one of these three use
cases happen in failure scenarios where almost every other use case
fails so one could dispute the value of being able to salvage a call
from a roaming user to a user in another branch when every other call
type fails... Also, every one of the three use cases salvaged by the
'backup-registrar' have severe limitations the biggest being that they
will only continue to work until the registration expires which could be
a very short period. All-in-all, I do not believe that the effort, risk
and complexity associated with the 'backup-registrar' solution are
justified by the modest gains it yields. My opinion is that this
analysis, if accurate, reveals that neither solution perform very well
in the face of severe impairments and that until we come up with a
solution that performs well under these conditions, we should take the
simplest approach...
Please feel free to correct me as needed. There is a lot of data
presented here and I want it to be as accurate as possible.
Thank you,
bob