Bạn có thể sử dụng Regex quái dị sau đây:
\b[[?:https?://]?[?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}]|[?:[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]\.]{3}[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]|[?:[?:[0-9a-fA-F]{1,4}:]{7,7}[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,7}:|[?:[0-9a-fA-F]{1,4}:]{1,6}:[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,5}[?::[0-9a-fA-F]{1,4}]{1,2}|[?:[0-9a-fA-F]{1,4}:]{1,4}[?::[0-9a-fA-F]{1,4}]{1,3}|[?:[0-9a-fA-F]{1,4}:]{1,3}[?::[0-9a-fA-F]{1,4}]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,2}[?::[0-9a-fA-F]{1,4}]{1,5}|[0-9a-fA-F]{1,4}:[?:[?::[0-9a-fA-F]{1,4}]{1,6}]|:[?:[?::[0-9a-fA-F]{1,4}]{1,7}|:]|fe80:[?::[0-9a-fA-F]{0,4}]{0,4}%[0-9a-zA-Z]{1,}|::[?:ffff[?::0{1,4}]{0,1}:]{0,1}[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]|[?:[0-9a-fA-F]{1,4}:]{1,4}:[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]]][?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]]?[?:/[\w\.-]*]*/?]\b
Demo regex101
Regex này sẽ chấp nhận các URL ở định dạng sau:
INPUT:
add1 //mit.edu.com abc
add2 //facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 //www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com.
another test with ipv4 //192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
OUTPUT:
//mit.edu.com
//facebook.jp.com
www.google.be
//www.google.be
www.website.gov.us
www.test.com
//192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg
Explanations:
\b
được sử dụng cho ranh giới từ để phân định URL và phần còn lại của văn bản[?:https?://]?
để phù hợp với // hoặc https // nếu có mặt[?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}]
để phù hợp với URL tiêu chuẩn [có thể bắt đầu vớiwww.
[hãy gọi nó là ____10]
1 để phù hợp với IPv4 tiêu chuẩn [hãy gọi nó làadd1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
2]add1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
- Để phù hợp với URL IPv6:
3 [hãy gọi nó làadd1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
4]add1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
- Để khớp với phần cổng [hãy gọi nó là
5] nếu có:add1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
6add1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
- Để khớp với phần đối tượng đích
7 của URL [tệp HTML, JPG, ...] [hãy gọi nó làadd1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
8]add1 //mit.edu.com abc add2 //facebook.jp.com.2. abc add3 www.google.be. uvw add4 //www.google.be. 123 add5 www.website.gov.us test2 Hey bob on www.test.com. another test with ipv4 //192.168.1.1/test.jpg. toto2 website with different port number www.test.com:8080/test.jpg not port 80 www.website.gov.us/login.html test with ipv4 192.168.1.1/test.jpg. search at google.co.jp/maps. test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
Điều này mang lại cho Regex sau:following regex:
\b[[?:https?://]?[?:STANDARD_URL|IPv4|IPv6][?:PORT]?[?:RESSOURCE_PATH]\b
Sources:
IPv6: Biểu thức chính quy phù hợp với địa chỉ IPv6 hợp lệ
IPv4: //www.safaribooksonline.com/l Library/view/regular-pressions-cookbook/9780596802837/CH07S16.html
Cổng: //stackoverflow.com/a/12968117/8794221
Các nguồn khác: //code.tutsplus.com/tutorials/8-regular-pressions-you-chould-know-- mạng-6149
$ more url.py
import re
inputString = """add1 //mit.edu.com abc
add2 //facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 //www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com.
another test with ipv4 //192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 [192.168.1.1/test.jpg].
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""
regex=ur"\b[[?:https?://]?[?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}]|[?:[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]\.]{3}[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]|[?:[?:[0-9a-fA-F]{1,4}:]{7,7}[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,7}:|[?:[0-9a-fA-F]{1,4}:]{1,6}:[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,5}[?::[0-9a-fA-F]{1,4}]{1,2}|[?:[0-9a-fA-F]{1,4}:]{1,4}[?::[0-9a-fA-F]{1,4}]{1,3}|[?:[0-9a-fA-F]{1,4}:]{1,3}[?::[0-9a-fA-F]{1,4}]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,2}[?::[0-9a-fA-F]{1,4}]{1,5}|[0-9a-fA-F]{1,4}:[?:[?::[0-9a-fA-F]{1,4}]{1,6}]|:[?:[?::[0-9a-fA-F]{1,4}]{1,7}|:]|fe80:[?::[0-9a-fA-F]{0,4}]{0,4}%[0-9a-zA-Z]{1,}|::[?:ffff[?::0{1,4}]{0,1}:]{0,1}[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]|[?:[0-9a-fA-F]{1,4}:]{1,4}:[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]]][?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]]?[?:/[\w\.-]*]*/?]\b"
matches = re.findall[regex, inputString]
print[matches]
OUTPUT:
$ python url.py
['//mit.edu.com', '//facebook.jp.com', 'www.google.be', '//www.google.be', 'www.website.gov.us', 'www.test.com', '//192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']