Re: [Ietf-behave] review of rfc3489bis-04
Bruce,
Thanks for the thoughtful and outstanding comments. Responses inline:
Bruce Lowekamp wrote:
Review of draft-ietf-behave-rfc3489bis-04
Bruce B. Lowekamp
College of William and Mary and
SIPeerior Technologies, Inc.
I waited to write this review until we completed implementing this
STUN draft inside the reSIProcate stack. A number of these comments
stem from the design of that implementation, which does initial
classification and parsing at the stack layer and then passes the
message up for further processing at the dialog/application level when
necessary.
I've grouped comments into two sections: protocol issues, where I
think the current specification should be changed in some way, and
document clarifications, where I think the draft has a mistake that
doesn't reflect the intention of the protocol. I also made some notes
on a few more general clarity and grammar issues that I am forwarding
to Jonathan separately.
Having worked with a previous implementation guided only by the
original 3489 and also worked with this version, 3489bis is
significantly cleaner and easier to read, understand, and implement.
Tiffany Broadbent did a lot of the implementation and many of these
comments stem from questions she brought up while developing.
Thanks for you and Tiffany for the comments. Responses below.
Protocol issues:
There are several issues that are motivated by the combination of
supporting new usages as extensions and trying to integrate our
implementation into an entire SIP stack.
- It would be much easier to implement a stack-level for STUN if
specific bits indicated whether a message is a request, response,
error response, or indication. In several spots the draft refers to
response and error response types as 0x100 and 0x110 higher, which I
assume is intended to indicate bits to be set, but I don't think it
technically reads that way. However, no way of determining whether
a type is an indication is mentioned in the draft. Providing for
future message types to be identified without recompiling the code
with a new list of types for a new usage would be very helpful.
I think this is a good idea, almost a necessity based on the layering
that is now introduced. You are correct that currently nothing indicates
that a message is an indication, and it should. Unfortunately the
currently allocated values for the TURN indications (which is the only
place they are being used) doesn't allow an easy way to identify them.
However I think we are better off changing the turn message values to
fix this.
I have now made this explicit by using two bits from the message type
header field to indicate the "class" of the message (request, success
response, error response, indication), and the remaining 12 bits are the
method. This formalizes the message type structure which has been there
till now. Unfortunately this will require a change in the message type
values for the indications in TURN, but I think its worth it.
- There are a number of issues if future usages are defined that are
not idempotent. For example, Section 6 allows bit-wise identical
requests to use the same transaction ID. I assume this is intended
to simplify periodic keepalives for low-power devices, however I
don't think that assuming re-using transaction IDs is safe for future
usages is a good idea. I suggest revising the two sentences (or one
if changed) in that paragraph that allow TID reuse to indicate that
TIDs cannot be reused except for retransmissions of the same request
unless allowed by the particular usage.
It wasn't meant to imply that you can reuse a TID for future
transactions. I'll reword.
- In a similar vein, there is currently no specification for how long
a server must remember a previous TID to ensure that a
retransmission or duplication is not processed twice. Also, there
is no longer any specification of what a client should do if it
receives a duplicate response. Seems like in both cases there
should be a suggestion to remember the previous TID for 9500ms to
avoid processing the same transaction multiple times.
This is an excellent point. I've added text around client and server
handling of retransmits. THe text on the server side normatively
requires retransmitting the response, but allows this to be done either
by storing the response or recomputing, the latter being possible only
for certain methods (namely the Binding Request). This is important to
allow for stateless processing in a basic server.
- Why is DTLS not mentioned?
Why would it be mentioned? STUN does not currently run on DTLS. If you'd
like to write a spec, go ahead :)
- I would like to see an overall transaction timer. Multiplying the
9500ms request timeout by a set of shared secret, nonce,
ALTERNATE-SERVER, and SRV cycles, the worst-case time for a single
transaction is quite bad. A misconfigured set of servers could set
up an ALTERNATE-SERVER loop.
I don't see a need for an additional timer. Certainly we have not done
this for SIP itself. If the application wants to abort at some point, it
can do so on its own. We don't need normative language around this.
- Are NONCEs intended to be used to challenge each individual
request/response or as a per-session value?
Per-session. That is, they are cached and reused in subsequent requests
to the same IP/port.
Section 8.3.2 says that
NONCE values should be cached and reused on subsequent
request/response. 8.3.1, item 2 says: "The message would not
contain the NONCE attribute," and at the end of the list the next
paragraph says that the NONCE is included only if the request is a
retry as a consequence of an error response. I think item 2 of
8.3.1 should say "The message MUST NOT contain the REALM attribute,"
regardless of the NONCE issue.
Yeah, item two is a bug. It should say REALM, as you indicate.
My impression previously has been
that the intention of the NONCE is to default to a per-session value
unless the server wants to authenticate more often, but 8.3.1
doesn't seem to support that.
- I don't see an error code for unknown message type. Perhaps 405?
(generating these properly also requires being able to identify whether
the unknown type is a request or an indication)
Per above we can now do this. I am somewhat conflicted on whether we
need to. The original STUN didn't include this sicne there just wasn't
the level of flexibility and extensibility to merit it. Even now, its
not clear. We're tying extensibility more to usages than methods. I'm
inclined to omit it, and add text that says to reject with a 400 if the
method is unknown. If I added a code for unknown methods, you'd need to
add a "supported"-like mechanism as SIP has for that code to be useful,
and furthermore only under the assumption that there is remedy (i.e.,
another method to try).
- 8.3.2 436 response: this would be the wrong error code for an
expired short-term, but should the client get another
shared-secret if the username was acquired through a
shared-secret request?
Good question. I think so, to handle cases where the problem IS in fact
expiry but the server has forgotten that it even expired the username.
I'll need to add words on not retrying infinitely in the case of
repeated 436.
- 8.3.2 500 response: several?
Is amazingly hard to be concrete and useful here. If STUN overload turns
into a real problem we'll need a proper mechanism for this. For now this
is an OK placeholder I think.
- We had a discussion at the last IETF of whether to retain the
current fingerprint, switch to a CRC, use a keyed CRC, or drop it.
I believe the consensus was to switch to a keyed CRC, but after
implementing this draft, I would favor making the fingerprint
optional. IMHO, the application multiplexing stun and application
data is in the best position to identify whether it is needed, and
in the vast majority of cases, it isn't needed. Even with our generic
implementation, it would be trivial for us to add an option to
include or not include FINGERPRINT when we're multiplexing over a
single socket, and none of our current applications would require
it.
THere seems to be some consensus around that. I've made it optional and
turned it into a CRC with xoring.
- 11.6: for our implementation, it was useful to use 408 and 484 to
deliver errors from the stack to the application layer. Since
STUN isn't designed to be proxied, it may not make sense
to include them here, but it does make the implementation easier
to use them.
I assume you are referring to SIP response codes here? Regarding
timeouts (408), I see no need to introduce them here. Your own
implementation can do that as a matter of design choice. We don't need
to say anything about it. Regarding 484 (address incomplete) - I dont
see the relevance.
- 11.9 why not just pad out the extra field?
1/2 dozen one hand, 6 on the other.
- in section 12, there is an explicit assumption that binding requests
will always require integrity, but keepalives won't. I don't think
I agree with this in all cases. If using STUN for binding inside a
TLS connection, I don't think integrity is required.
Fair enough; I dont think the spec says you have to use it though. So
I'm not sure what you are asking me to change with the binding usage.
In the
keepalive case, I also don't think I agree that re-establishing a
session is necessarily a trivial cost in time or in lost behavior.
Well, triviality is relative. Relative to the costs of authenticated
each keepalive, I think its trivial :)
First, a lot of work has been done with sip-outbound to stage this
sort of reconnection.
Right... nothing here is contradicting that.
Secondly, since STUN will be applied to other
protocols beside SIP, I think this distinction assumes
characteristics of the application protocol that are not known in
advance. I would prefer SHOULD-level strength here, and some
discussion of how to choose when to require message integrity and
when not to.
I will soften it a bit. However, for the case of sip-outbound I am quite
opposed to authentication. Practically speaking, it is unfathomably
expensive. I would rather remove the feature which has the UA
re-register if it sees a change in IP.
- 12.3.1: should a changed binding be interpreted as a clue to reduce
the refresh interval? (I assume the answer is yes, and mostly
think it should be explicitly suggested by the text)
Hmm. I think that could be dangerous. An attacker could force you into
rapid refreshes. What would the trigger be for you to increase the time
again?
- 12.3.2: assuming that keepalives are used for a fixed server also
extends from assuming SIP is the application protocol. This
assumption should probably be weakened/removed.
What do you mean by "fixed"? There is an assumption about keeping
bindings alive for a long lived associated for receipt of incoming
communications. That assumption is stated in 12.3.1.
- 12.4.1/12.4.7: one says shared-secrets are specific to SRV Priority,
the other to domain name
This is a bug, thanks. It can't hurt to be conservative and use it on
all SRV records. I'll fix 12.4.1.
Document clarifications:
section 6: as mentioned on the mailing list, the header includes the
magic cookie and the body has at least one attribute in the current
spec.
Fixed (though its now zero or more with fingerprint being optional)
section 6, 3rd par: stun can also run over TLS/TCP, right?
fixed.
end of 7.1 and 7.2: I think these are supposed to say the same thing
about trying multiple SRV results, but it's a bit confusing since
they're worded completely differently. I prefer the 7.1 version.
ok, i made them consistent using the 7.1 version
end of 8.2: should refer to a section
I'm not sure what you are referring to.
8.3.2 first par: what if there is a MESSAGE-INTEGRITY in the response
but not request
good catch - weird case. Ignore MESSAGE-INTEGRITY and continue. I've
noted that.
8.3.2 starting after the second paragraph: Every paragraph has a
sentence or two, worded slightly differently, about using a NONCE if
returned. Can this be grouped into a single paragraph before the
long list? Also, toward the end of the section, there is the
paragraph that refers to caching NONCEs. This paragraph should be
moved here along with the global NONCE rules.
Actually towards the end it repeats the simple rule that can be
distilled - a retry contains the nonce from the error response. I've
consolidated as you suggest.
8.3.2 starting after the second paragraph: After each code, put the
parenthesized reason phrases from 11.6. It's too hard to read
with just the codes.
fixed.
8.3.2 third par: remind that the long-term password is pre-provisioned
here
added
- 8.3.2 300 response: should the same s-t credentials be used?
depends on domain or on srv response?
may as well always do it. If you guess wrong you'll be re-challenged
anyway. Typically alternate-server is across a cluster so they would work.
- 8.3.2 500 paragraph: Unknown response codes should start a new
paragraph, 3XX should be included.
fixed.
- 8.3.2 paragraph after 500: I think this should say that the unknown
optional attributes should be ignored, not a response containing
unknown optional attributes.
oops - big bug there. fixed.
- 8.3.2 last two paragraphs: why are these two paragraphs here?
Seems like they should be later on in the sections describing
handling the attributes.
removed
- 8.3.1 and 8.4: UDP can be pipelined, as well.
fixed
- 9.1 the one sentence before 9.1.1 seems to apply only to 9.1.1
removed
- 9.1.1 1st par: can do stun binding over tls too, as well
fixed; though if I were being pedantic I'd point out that PORTS are only
defined for UDP and TCP. TCP/TLS is not a separate transport protocol.
- 9.1.1: there are two different paragraphs that refer to generating a
432 if the USERNAME is not present.
removed second occurence
- 9.1.1: can there be one paragraph describing the possibility of
attaching a NONCE to almost every error response?
moved around. Also added response names like the client processing.
- 9.1.2: exclusive-or of the source IP address and port from the
request -> exclusive-or of the source transport address and
magic cookie.
Another big oops.
- 9.1.2: before the FINGERPRINT paragraph, there should be a paragraph
that if the request contained MESSAGE-INTEGRITY, the response
must have MESSAGE-INTEGRITY generated to match 8.3.2
good catch - fixed.
- 9.1.2: FINGERPRINT paragraph : indication->response
fixed
- 9.1.2 last sentence: requests defined -> usages defined
actually I think "methods" is better
- end of 9.2: depends on the message type -> depends on the usage.
for consistency with earlier client discussion.
changed to method also
- 10 end of 1st par: requests and responses -> messages and
indications
actually 'all message types' is correct here
- 11 Figure 4: I think MESSAGE-INTEGRITY should be C for Response and
Error Response based on 8.3.2 requiring it if the request used it.
ok
- 11.1: should probably say IP addr is in network byte order
pl
- 11.3: used as the HMAC *key* in the
fixed
- 11.4: why "generally"?
FINGERPRINT. I clarified.
- 11.4 last sentence: shared secret->password to include both short
and long term?
I think its OK as written.
- 12.2.1/12.2.7: 12.2.1 "assumed" that the rendezvous gives the shared
secret, 12.2.7 says it MUST
- 12.2.2 last sentence: actual communications -> application data traffic.
This has all been removed since connectivity checks are entirely in ICE now.
- 12.4.1 is 9 minutes a MUST?
Yes. THe normative requirement is spelled out in the server processing
section.
- 12.4.9: the very end refers to REFLECTED-FROM. I know this is a
historical leftover, but I can't figure out what the sentence
is trying to say.
Its all historic. I removed the whole bit about using clientIP.
- 14.1: since this draft allows for and refers to extensions,
including stun-relay explicitly, I think it's better to say
something like STUN is intended to provide the functionality
necessary to describe how to connect two endpoints regardless of
the location of type of NATs in the topology.
Wow! Talk about a thorough read - I thought I was the only one that
would ever read this section. Fixed.
Thanks,
Jonathan R.
--
Jonathan D. Rosenberg, Ph.D. 600 Lanidex Plaza
Cisco Fellow Parsippany, NJ 07054-2711
Cisco Systems
jdrosen@xxxxxxxxx FAX: (973) 952-5050
http://www.jdrosen.net PHONE: (973) 952-5000
http://www.cisco.com