Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement request: Adding some error information for DNS requests like RQ#19

Open
borjam opened this issue Nov 23, 2022 · 3 comments

Comments

@borjam
Copy link

borjam commented Nov 23, 2022

dnstap is really comprehensive as a DNS server monitoring solution.

Thanks to dnstap it is really simple to obtain, for instance, response time data for queries and responses. Because dnstap
includes the query timestamps in response messages, obtaining the response time is simple without needing to keep track of individual queries and responses, using the context information stored by the DNS server instead.

However, there is a situation in which dnstap (in my opinion) falls short: timeouts due to packet loss or non responsive servers are bit reported through dnstap.

This means that in order to obtain this data the possibilities are:

  • Lame server logging. Which varies a lot among implementations. Even with Bind 9, branches 9.16 and 9.18 have a very different behavior when logging timed out or unreachable lame servers. For other DNS implementations I don´t know, at least I think that Unbound doesn´t register that kind of errors (but I haven´t checked seriously)
  • Dnstap in its present state. A separate program could be keeping track of RQ and RR dnstap messages, generating a new type of event (timeout) for "missing" RR messages. Apart from the added complexity, some information would be lost. For example, it wouldn´t be possible to know what happened, a timeout, network unreachable error or something else.

The second option doesn´t look so good. Moreover, dnstap seems to be designed from the ground up to avoid a situation like that. Reply messages benefit from the DNS server software being aware of the query state and response messages include the query timestamp when available.

Although it would break one aspect of dnstap in which it tries to behave as close as possible to a packet capture on steroids, that kind of out of band messages would (in my opinion) greatly improve it.

At least in the situation I am describing, detecting certain errors when trying to querying another DNS server, I guess the performance impact would be negligible and all of the state information needed is already in place.

What do you think?

@borjam borjam changed the title Enhancement request: Adding timed out messages for resolver requests Enhancement request: Adding some error information for DNS requests like RQ Nov 23, 2022
@borjam
Copy link
Author

borjam commented Nov 23, 2022

This would be equally useful for FQ/FR messages of course.

@borjam
Copy link
Author

borjam commented Nov 24, 2022

A possible way (which would be consistent I think) would be to add a second message type, so that "Message" (the only supported type for now) still reflects actual queries and responses, while the second one "Error?" could include information about the failed query and the error information (unreachable destination, timeouts, etc).

@edmonds
Copy link
Member

edmonds commented Nov 30, 2022

Although it would break one aspect of dnstap in which it tries to behave as close as possible to a packet capture on steroids, that kind of out of band messages would (in my opinion) greatly improve it.

This is not correct. dnstap is an instrumentation format for representing events that occur inside DNS software. It is not a "packet capture on steroids" format. E.g., dnstap is oblivious to the packetized representation (TCP segmentation, IP fragmentation, TLS encryption) of a wire-format DNS message. Similarly, packet capture representations of DNS server traffic cannot capture metadata that dnstap can export (e.g., the Message.query_zone metadata field, for recursive DNS implementations.) In general it's not possible for a stream of dnstap messages to be converted into a packet capture format or for a stream of packets in an existing packet capture format to be converted into a stream of dnstap messages.

However, there is a situation in which dnstap (in my opinion) falls short: timeouts due to packet loss or non responsive servers are bit reported through dnstap.

This means that in order to obtain this data the possibilities are:

  • Lame server logging. Which varies a lot among implementations. Even with Bind 9, branches 9.16 and 9.18 have a very different behavior when logging timed out or unreachable lame servers. For other DNS implementations I don´t know, at least I think that Unbound doesn´t register that kind of errors (but I haven´t checked seriously)
  • Dnstap in its present state. A separate program could be keeping track of RQ and RR dnstap messages, generating a new type of event (timeout) for "missing" RR messages. Apart from the added complexity, some information would be lost. For example, it wouldn´t be possible to know what happened, a timeout, network unreachable error or something else.

A timeout is intrinsically an event that occurs inside DNS software, so it would be plausible to design new protobuf message type(s) for the dnstap protobuf schema and instrument DNS servers to support the new message type(s).

Unbound's timeout algorithm is described here: https://www.nlnetlabs.nl/documentation/unbound/info-timeout/.

The second option doesn´t look so good. Moreover, dnstap seems to be designed from the ground up to avoid a situation like that. Reply messages benefit from the DNS server software being aware of the query state and response messages include the query timestamp when available.

At least in the situation I am describing, detecting certain errors when trying to querying another DNS server, I guess the performance impact would be negligible and all of the state information needed is already in place.

What do you think?

I think there are a lot of possible "error" events that can occur in a recursive DNS server beyond just network timeouts, for instance RFC 8914 (Extended DNS Errors) specifies an in-band way of encoding several dozen different error codes in response to a client that supports the EDE option. These will get logged into dnstap incidentally by a recursive DNS server when responding to clients that set the EDE option. But maybe it makes sense to design a dnstap schema for encapsulating an out-of-band EDE-like payload so that the server operator can log the occurrences of these kinds of errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants