Hướng dẫn get link from text python - lấy liên kết từ văn bản python

Bạn có thể sử dụng Regex quái dị sau đây:

\b[[?:https?://]?[?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}]|[?:[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]\.]{3}[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]|[?:[?:[0-9a-fA-F]{1,4}:]{7,7}[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,7}:|[?:[0-9a-fA-F]{1,4}:]{1,6}:[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,5}[?::[0-9a-fA-F]{1,4}]{1,2}|[?:[0-9a-fA-F]{1,4}:]{1,4}[?::[0-9a-fA-F]{1,4}]{1,3}|[?:[0-9a-fA-F]{1,4}:]{1,3}[?::[0-9a-fA-F]{1,4}]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,2}[?::[0-9a-fA-F]{1,4}]{1,5}|[0-9a-fA-F]{1,4}:[?:[?::[0-9a-fA-F]{1,4}]{1,6}]|:[?:[?::[0-9a-fA-F]{1,4}]{1,7}|:]|fe80:[?::[0-9a-fA-F]{0,4}]{0,4}%[0-9a-zA-Z]{1,}|::[?:ffff[?::0{1,4}]{0,1}:]{0,1}[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]|[?:[0-9a-fA-F]{1,4}:]{1,4}:[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]]][?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]]?[?:/[\w\.-]*]*/?]\b

Demo regex101

Regex này sẽ chấp nhận các URL ở định dạng sau:

INPUT:

add1 //mit.edu.com abc
add2 //facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 //www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 //192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.

OUTPUT:

//mit.edu.com
//facebook.jp.com
www.google.be
//www.google.be
www.website.gov.us
www.test.com
//192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg

Explanations:

  • \b được sử dụng cho ranh giới từ để phân định URL và phần còn lại của văn bản
  • [?:https?://]? để phù hợp với // hoặc https // nếu có mặt
  • [?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}] để phù hợp với URL tiêu chuẩn [có thể bắt đầu với www. [hãy gọi nó là ____10]
  • add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    1 để phù hợp với IPv4 tiêu chuẩn [hãy gọi nó là
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    2]
  • Để phù hợp với URL IPv6:
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    3 [hãy gọi nó là
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    4]
  • Để khớp với phần cổng [hãy gọi nó là
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    5] nếu có:
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    6
  • Để khớp với phần đối tượng đích
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    7 của URL [tệp HTML, JPG, ...] [hãy gọi nó là
    add1 //mit.edu.com abc
    add2 //facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 //www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 //192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    8]

Điều này mang lại cho Regex sau:following regex:

\b[[?:https?://]?[?:STANDARD_URL|IPv4|IPv6][?:PORT]?[?:RESSOURCE_PATH]\b

Sources:

IPv6: Biểu thức chính quy phù hợp với địa chỉ IPv6 hợp lệ

IPv4: //www.safaribooksonline.com/l Library/view/regular-pressions-cookbook/9780596802837/CH07S16.html

Cổng: //stackoverflow.com/a/12968117/8794221

Các nguồn khác: //code.tutsplus.com/tutorials/8-regular-pressions-you-chould-know-- mạng-6149

$ more url.py

import re

inputString = """add1 //mit.edu.com abc
add2 //facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 //www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 //192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 [192.168.1.1/test.jpg].
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""

regex=ur"\b[[?:https?://]?[?:[?:www\.]?[?:[\da-z\.-]+]\.[?:[a-z]{2,6}]|[?:[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]\.]{3}[?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?]|[?:[?:[0-9a-fA-F]{1,4}:]{7,7}[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,7}:|[?:[0-9a-fA-F]{1,4}:]{1,6}:[0-9a-fA-F]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,5}[?::[0-9a-fA-F]{1,4}]{1,2}|[?:[0-9a-fA-F]{1,4}:]{1,4}[?::[0-9a-fA-F]{1,4}]{1,3}|[?:[0-9a-fA-F]{1,4}:]{1,3}[?::[0-9a-fA-F]{1,4}]{1,4}|[?:[0-9a-fA-F]{1,4}:]{1,2}[?::[0-9a-fA-F]{1,4}]{1,5}|[0-9a-fA-F]{1,4}:[?:[?::[0-9a-fA-F]{1,4}]{1,6}]|:[?:[?::[0-9a-fA-F]{1,4}]{1,7}|:]|fe80:[?::[0-9a-fA-F]{0,4}]{0,4}%[0-9a-zA-Z]{1,}|::[?:ffff[?::0{1,4}]{0,1}:]{0,1}[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]|[?:[0-9a-fA-F]{1,4}:]{1,4}:[?:[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]\.]{3,3}[?:25[0-5]|[?:2[0-4]|1{0,1}[0-9]]{0,1}[0-9]]]][?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]]?[?:/[\w\.-]*]*/?]\b"

matches = re.findall[regex, inputString]
print[matches]

OUTPUT:

$ python url.py 
['//mit.edu.com', '//facebook.jp.com', 'www.google.be', '//www.google.be', 'www.website.gov.us', 'www.test.com', '//192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']

Bài Viết Liên Quan

Chủ Đề