Hi! I’m sorry no easy answer here.
Here is a document I created with a few approaches: https://public.getgrist.com/5uak5drY3QD3/Extract-Domain/m/fork
Getting the host from URL is straightforward (I am using an import style that will work unchanged with Python2 and Python3 – because of the upcoming Python3 project):
from six.moves.urllib.parse import urlparse
To handle either valid URLs or plain hostnames, you can add:
urlparse($URL).hostname if "/" in $URL else $URL
To extract the last two dot-separated parts of a host, this can be used:
Another simplistic approach is in the “Skip One” column – useful if you expect to skip one subdomain part normally, and that’s all.
But if you are using the
tldextract library, you probably care about the multi-part TLDs, like
co.uk. There are surprisingly many of those, and there is no pattern. It’s impossible to replicate the full functionality of
tldextract without storing that full list somewhere, but you can have a partial solution that will work for many common cases.
The last column in the document is of the type you mention (regex):
re.sub(r'^.*?(\w+\.(\w+|(co|com|org)\.\w\w|\w\w\.us))$', r"\1", $Host)
It allows adding more special handling to the regular expression to capture other likely cases of multi-part TLDs. As you say, it’s imperfect and complicated, but hopefully useful if it gets the job done.