Extracting root domain

DivMode · September 15, 2021, 9:04am

I use the tldextract package for python to extract the root domain

Is there an easy way to do this using standard python libraries since I believe at this time Im not able to install my own python libraries? I tried looking at couple of examples done in urlparse and regex but it isnt perfect and overly complicated.

import tldextract

list = tldextract.extract('http://blog.divmode.com/about')

domain_name = list.domain + '.' + list.suffix

dmitry-grist · September 27, 2021, 4:09pm

Hi! I’m sorry no easy answer here.

Here is a document I created with a few approaches: https://public.getgrist.com/5uak5drY3QD3/Extract-Domain/m/fork

Getting the host from URL is straightforward (I am using an import style that will work unchanged with Python2 and Python3 – because of the upcoming Python3 project):

from six.moves.urllib.parse import urlparse
urlparse($URL).hostname

To handle either valid URLs or plain hostnames, you can add:

urlparse($URL).hostname if "/" in $URL else $URL

To extract the last two dot-separated parts of a host, this can be used:

".".join($Host.split(".")[-2:])

Another simplistic approach is in the “Skip One” column – useful if you expect to skip one subdomain part normally, and that’s all.

But if you are using the tldextract library, you probably care about the multi-part TLDs, like co.uk. There are surprisingly many of those, and there is no pattern. It’s impossible to replicate the full functionality of tldextract without storing that full list somewhere, but you can have a partial solution that will work for many common cases.

The last column in the document is of the type you mention (regex):

re.sub(r'^.*?(\w+\.(\w+|(co|com|org)\.\w\w|\w\w\.us))$', r"\1", $Host)

It allows adding more special handling to the regular expression to capture other likely cases of multi-part TLDs. As you say, it’s imperfect and complicated, but hopefully useful if it gets the job done.