January 9th, 2022 The WHOIS protocol is one of the older internet protocols around. It's infuriatingly simple, by and large considered obsolete, and the data provided by it unpredictable, unreliable, incomplete, and, of course, still one of the corner stones of internet operations. In other words, it's the kind of thing I like to waste my time on trying to understand. Originally set up in the 1970s at the Stanford Research Institute Network Information Center (aka SRI-NIC) by the mother of the DNS and overall ARPANET boss Elizabeth J. Feinler, WHOIS was first described in RFC812 (1982). Based on the FINGER protocol, it was as dead simple as you could imagine: Connect to the service host (SRI-NIC) TCP: service port 43 decimal NCP: ICP to socket 43 decimal, establishing two 8-bit connections Send a single "command line", ending with <CRLF>. Receive information in response to the command line. Yep, that was it. And that's still the full protocol specification (now RFC3912 (2004)). Here, give it a try: $ telnet whois.iana.org 43 Trying 2620:0:2d0:200::59... Connected to ianawhois.vip.icann.org. Escape character is '^]'. org % IANA WHOIS server % for more information on IANA, visit http://www.iana.org % This query returned 1 object domain: ORG [...] Congratulations - you just spoke WHOIS! The data you get back is intentionally not structured and is designed to be human-, not machine-readable (more on that a bit below). It was originally intended to provide contact information including "mailing address, telephone number, and network mailbox" "for ARPANET users" like so: Command line: dyer Response: Dyer, David A. (DAD2) DDYER@USC-ISIB (213) 822-1511 Dyer, Fred S. (FSD) Dyer@RADC-MULTICS (315) 330-7275 Dyer, Mary K. (MARY) DYER@SRI-NIC (415) 859-4775 Dyer, William R. (WRD) WRDyer@RADC-MULTICS (315) 330-7791 Command line: mary Response: Dyer, Mary K. (MARY) DYER@SRI-NIC SRI International Network Information Center Telecommunications Sciences Center 333 Ravenswood Avenue Menlo Park, California 94025 Phone: (415) 859-4775 And you thought the DNS was the phonebook of the internet... How to find the responsible WHOIS serverWhen the internet grew too large for SRI-NIC to continue functioning as the global phonebook, and eventually with the transfer of the operation of the DNS root to ICANN, WHOIS also became decentralized. Information about the various (and increasing number of) TLDs was provided logically by the Regional Internet Registries (RIRs), registries, and registrars. Some of them run a so-called "thick" server, which provides all the information; others are "thin" servers, only providing the information of the WHOIS server that does have the full information. Different TLDs, for example, may operate in either mode, but the protocol does not provide any means to differentiate the two. In other words: if you wanted to find out information about a domain, you'd have to know who the responsible registry is to ask them. How do you know what WHOIS server to query for a given domain? Well, you just gotta know. There's no standardized way. Some domains use SRV DNS records as suggested in this internet draft: $ host -t srv _nicname._tcp.co.uk _nicname._tcp.co.uk has SRV record 0 0 43 whois.nic.uk. $ host -t srv _nicname._tcp.arab _nicname._tcp.arab has SRV record 10 10 0 your-dns-needs-immediate-attention.arab. $ host -t srv _nicname._tcp.cpa _nicname._tcp.cpa has SRV record 10 10 0 your-dns-needs-immediate-attention.cpa. $ host -t srv _nicname._tcp.music _nickname._tcp.music has SRV record 10 10 0 your-dns-needs-immediate-attention.music. $ host -t srv _nicname._tcp.xn--fiqs8s _nicname._tcp.中国 is an alias for wildcard.cnnic.cn. $ host -t srv _nicname._tcp.xn--fiqz9s _nicname._tcp.中國 is an alias for wildcard.cnnic.cn. $ host -t srv _nicname._tcp.xn--mxtq1m _nicname._tcp.政府 has SRV record 10 10 0 your-dns-needs-immediate-attention.政府 $ host -t srv _nicname._tcp.xn--ngbrx _nicname.tcp.عرب has SRV record 10 10 0 your-dns-needs-immediate-attention.عرب $ ...but that seems to function primarily as an indicator of a TLD compromise: out of 1489 TLDs, only nic.uk has a valid entry. Instead, some TLDs use <tld>.whois-servers.net, and the "new" TLDs after 2003 are supposed to have whois.nic.<tld>; ccTLDs pretty much all do their own thing, why not. Hence, your whois(1) client likely contains some optimistic logic and a number of hardcoded RIR WHOIS servers like this: #define ANICHOST "whois.arin.net" #define BNICHOST "whois.registro.br" #define CNICHOST "whois.corenic.net" #define DNICHOST "whois.nic.mil" #define FNICHOST "whois.afrinic.net" #define GNICHOST "whois.nic.gov" #define IANAHOST "whois.iana.org" #define INICHOST "whois.networksolutions.com" #define LNICHOST "whois.lacnic.net" #define MNICHOST "whois.ra.net" #define NICHOST "whois.crsnic.net" #define PDBHOST "whois.peeringdb.com" #define PNICHOST "whois.apnic.net" #define QNICHOST_TAIL ".whois-servers.net" #define RNICHOST "whois.ripe.net" #define RUNICHOST "whois.ripn.net" [...] /* * If no country is specified determine the top level domain from the query * If the TLD is a number, query ARIN, otherwise, use TLD.whois-server.net. * If the domain does not contain '.', check to see if it is an NSI handle * (starts with '!') or a CORE handle (COCO-[0-9]+ or COHO-[0-9]+) or an * ASN (starts with AS) or IPv6 address (contains ':'). Fall back to * NICHOST for the non-handle and non-IPv6 case. */ Otherwise, if you don't know the WHOIS server to query, you can try your luck asking IANA, which runs a "thick" server for all TLDs. It should return to you the referral to the responsible WHOIS server, which you can then ask for who might be responsible for the final domain you care about: $ echo netmeister.org | nc whois.iana.org 43 | grep refer refer: whois.pir.org $ echo netmeister.org | nc whois.pir.org 43 | grep refer $ echo netmeister.org | nc whois.pir.org 43 | grep "Registrar WHOIS Server" Registrar WHOIS Server: whois.gandi.net $ echo netmeister.org | nc whois.gandi.net 43 | grep Creation Creation Date: 2000-04-24T02:15:22Z $ Notice something? When we ask IANA, we ask for "refer", but when we ask PIR, we need to ask for "Registrar WHOIS Server". This is because the WHOIS protocol does not specify the output format of the data, nor what data should be provided. At all. It's all free form, unstructured ASCII text -- if you're lucky, that is. (More on that (again) a bit below.) Data PrivacyBut what data would you expect to be found in WHOIS? Since the early days, ICANN has had a requirement for registries and registrars to provide unrestricted and public access to accurate and complete WHOIS information, including registrant, technical, billing, and administrative contact information.ICANN Policies This includes the actual postal address, phone numbers, and email addresses of the various contact persons or departments (see above re "phonebook"). Which of course is routinely abused by all sorts of people, including by scammers, phishers, and for general OSINT. On the other hand, Law Enforcement really wants this information to be readily available, and as a geek with at least half a dozen random domains registered, you are likely familiar with the legal requirement to keep this information up to date. Quite obviously this poses a dilemma: the information is required by ICANN to be openly provided, but for a variety of reasons and privacy concerns, you don't want your phone number and address out there on the internet. But more than just a cosmetic concern, the ICANN requirement now does indeed conflict with modern privacy laws, such as the EU's GDPR, meaning all domains registered by European registries are in violation of either GDPR or ICANN's requirement. Fun! (ICANN promised not to take action against violators, and registries/registrars nowadays provide redacted information to the public but promise to provide detailed information upon "legitimate requests".) Data FormatAs I noted above, the data provided via WHOIS is completely unstructured and undefined. It is intended for human consumption, and the service operator is free to decide how to display the information. Most WHOIS servers use a simple "key: value" format, but that's far from universal. Similarly, different servers use different methods to e.g., show that certain pieces of information logically belong together. For example, consider the information returned by the different WHOIS servers involved in a simple lookup of this website: $ whois netmeister.org % IANA WHOIS server % for more information on IANA, visit % http://www.iana.org % This query returned 1 object refer: whois.pir.org domain: ORG organisation: Public Interest Registry (PIR) address: 11911 Freedom Drive 10th Floor, address: Suite 1000 address: Reston, VA 20190 address: United States contact: administrative name: Director of Operations, Compliance and Customer Support organisation: Public Interest Registry (PIR) address: 11911 Freedom Drive 10th Floor, address: Suite 1000 address: Reston, VA 20190 address: United States phone: +1 703 889 5778 fax-no: +1 703 889 5779 e-mail: ops@pir.org [...] # whois.pir.org Domain Name: NETMEISTER.ORG Registry Domain ID: D25516943-LROR Registrar WHOIS Server: whois.gandi.net Registrar URL: http://www.gandi.net Updated Date: 2021-02-20T17:59:09Z Creation Date: 2000-04-24T02:15:22Z [...] # whois.gandi.net Domain Name: netmeister.org [...] Registry Registrant ID: REDACTED FOR PRIVACY Registrant Name: REDACTED FOR PRIVACY [...] Registry Admin ID: REDACTED FOR PRIVACY Admin Name: REDACTED FOR PRIVACY [...] Registry Tech ID: REDACTED FOR PRIVACY Tech Name: REDACTED FOR PRIVACY [...] >>>Last update of WHOIS database: 2022-01-09T00:16:58Z <<< Ok, so far, so good. Different grouping, but still, reasonably easy to parse. Now compare this to the following other queries returning results from various WHOIS servers:
Given how useful the information in WHOIS can be, it's no surprise that there are many businesses offering proprietary services to monetize the munging of the public information into a data format that's easy to process in an automated fashion, such as in XML or JSON. As you can tell from the above examples, it's fairly obvious how the information belongs together for a human: Humans are really, really good at identifying patterns visually, and you can all look at the output and immediately see what data represents what information, but trying to convince a computer to understand all these different formats is a major PITA and exactly what these services build their profit model on. Paying for an online service to access public data is a bit annoying, so I wrote a tool to JSONify WHOIS data: jswhois(1). This tool will attempt to turn the unstructured, human-readable output above into structured JSON as shown below:
This is tedious, sure, but what's even more annoying is that it still is only of limited usefulness: aside from the lack of a data format, there is also no standard specification of what data is to be provided, and for the data that is required at least by ICANN, there is no requirement or specification of how that data is to be named. That is, if you want to use jswhois(1) to return to you the email address of the administrative contact of the domain in question, then you still have to know what the fields returned by the registrar's WHOIS server are named. Commercial services may attempt to reformat or rename fields so that you have consistent keys to extract, but will that work for all domains? How many different WHOIS formats are there? Registrars and RegistriesLooking at a subset of TLDs from my previous adventure, I found a total of 1021 distinct WHOIS servers for 1489 TLDs. Here's the top ten breakdown of which WHOIS servers are responsible for the most number of TLDs: 244 whois.iana.org 67 whois.afilias-srs.net 46 whois.nic.google 24 whois.uniregistry.net 16 whois.registry.in 14 whois.nic.gmo 8 whois.gtld.knet.cn 7 whois.teleinfo.cn 6 whois.gtlds.nic.br 5 whois.publicinterestregistry.net IANA, Afilias, and Uniregistry not surprisingly manage the largest number of TLDs, and as you may remember from the new-TLD-landrush, Google had applied for over 100 TLDs and today runs 46 TLDs. (The largest number of TLDs registered by a single company goes to Donuts Inc. with 248, but they run a separate WHOIS server for each of those TLDs at whois.nic.<tld>.) But that's only TLDs. There are over 2500 registrars accredited by ICANN, of which e.g., GoDaddy, currently the largest with over 72 million (!) domains, is just one. In theory, for each of the millions of second-level domains, there might be a different WHOIS server responsible, each with its own human-readable output format. Data in WHOISThe data found in WHOIS varies from registry to registry, not only in structure (as shown above), but of course also in content. Some include nameserver IP addresses, some don't. Some include DNSSEC information, others don't. I even found an (expired) x509 cert in the WHOIS data for 2001:dcd::/32. If you search for IP addresses or CIDRs, you get back rather different data than if you search for domain names. APNIC, RIPE, and AFRINIC, for example, even give you some routing and geolocation information: $ jswhois 2001:dd8:9:2::101:61 | jq { "query": "2001:dd8:9:2::101:61", "whois.apnic.net": { "inet6num": { "geoloc": "-27.473058 153.014208", "inet6num": "2001:dd8:8::/45", [...] } "route6": { "country": "AU", "descr": "APNIC Network", "last-modified": "2018-11-20T03:36:54Z", "mnt-by": "MAINT-APNIC-IS-AP", "origin": "AS4608", "route6": "2001:dd8:9::/48", "source": "APNIC" } [...] Given the loose specification, you can use the WHOIS protocol and server for just about any data. Team Cymru, for example, lets you look up AS numbers for the given IP addresses using WHOIS: $ whois -h whois.cymru.com 2001:470:30:84:e276:63ff:fe72:3900 AS | IP | AS Name 2033 | 2001:470:30:84:e276:63ff:fe72:3900 | PANIX, US And as you've no doubt noticed, some international WHOIS servers may return data to you in non-ASCII charsets, such as e.g., whois.kr, or whois.jprs.jp. How well do the various WHOIS API services handle what effectively amounts to random data that may be returned? I wonder... $ whois -h whois.netmeister.org log4j ___________________________________________________ < ${jndi:ldap://www.netmeister.org/blog/whois.html} > --------------------------------------------------- \ ^___^ \ (ooo)\_______ (___)\ )\/\ ||----w | || || Old and busted...Since the data in WHOIS is unpredictable (who knows what data is returned to you and what the format might be), unreliable (who knows if the data you're looking for, if it is present at all, is up to date), difficult to discover (bouncing from IANA along unpredictable, unreliable referral entries or betting on a few hard-coded servers), possibly available via different mechanisms (besides the standard TCP port 43, several WHOIS servers provide an HTTP API endpint), and often obscured or redacted (e.g., due to GDPR, but several WHOIS servers also require registration before either TCP port 43 or API access is granted)... why haven't we replaced it with Something Better(tm)? There were some attempts to overhaul WHOIS, like the "Referral Whois" protocol (RWhois, RFC2167) or the now obsolete "WHOIS++", but it seems like one of those things everybody depends on, so changing it isn't going to be easy. ICANN decided years ago to replace WHOIS with work dating back to 2012, and the "Registration Data Access Protocol (RDAP, RFC9082) certainly seems like a much better alternative. RDAP is RESTful and standardized based on an analysis by the IETF of the TLD WHOIS server responses; since 2019, ICANN requires registrars and registries to implement an RDAP service. Fully replacing WHOIS does, however, not yet seem to be on the horizon, and we're still relying on what started out as perhaps the simplest possible protocol intended for human consumption. Sometimes the internet moves really slowly, and all I can hope is that nobody comes along and tries to put it on the blockchain... January 9th, 2022 Links:
References: |