Member-only story

Python — Extracting Domain Name From URLs Using Regular Expressions

2 min readFeb 26, 2020

As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.

For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.

The hard part is knowing if the name is at the second or third level or so on.

What is a Regular Expression and which module is used in Python?

Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

Example -

# Python program to extract domain names from the list of website URLs By Regular Expression.

# Importing module required for regular expressions

import re

# List of website URLs
domainlist=[‘m.google.com’,
‘m.docs.google.com’,
‘www.someisotericdomain.innersite.mall.co.uk',
‘www.ouruniversity.department.mit.ac.us',
‘www.somestrangeurl.shops.relevantdomain.net',
‘www.example.info']

#print values in the list
print(domainlist)

[‘m.google.com’, ‘m.docs.google.com’, ‘www.someisotericdomain.innersite.mall.co.uk', ‘www.ouruniversity.department.mit.ac.us', ‘www.somestrangeurl.shops.relevantdomain.net', ‘www.example.info']

# Read list by for loop

# get list of domain

# The regex will have to be enormous in order to catch all kinds of domains

# It returns domain from URL.

Python — Extracting Domain Name From URLs Using Regular Expressions

Written by Ryan Arjun

No responses yet