8.2. Finding URLs Within Full Text
Problem
You want to find URLs in a larger body of text. URLs may or may not be enclosed in punctuation, such as parentheses, that are not part of the URL.
Solution
URL without spaces:
\b(https?|ftp|file)://\S+
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation:
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation. URLs that start with the
www
or ftp
subdomain can omit the scheme:
\b((https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Discussion
Given the text:
Visit http://www.somesite.com/page, where you will find more information.
what is the URL?
Before you say http://www.somesite.com/page
, think about
this: punctuation and spaces are valid characters in URLs. Though RFC
3986 (see Recipe 8.7) does not allow literal
spaces in URLs, all major browsers accept URLs with literal spaces just
fine. Some WYSIWYG web authoring tools even make it easy for the user to
put spaces in file and folder names, and include those spaces literally
in links to those files.
That means that if we use a regular expression that allows all valid URLs, it will find this URL in the preceding text:
http://www.somesite.com/page, where ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.