Hướng dẫn get link from text python - lấy liên kết từ văn bản python

Bạn có thể sử dụng Regex quái dị sau đây:

\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b

Demo regex101

Regex này sẽ chấp nhận các URL ở định dạng sau:

INPUT:

add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.

OUTPUT:

http://mit.edu.com
https://facebook.jp.com
www.google.be
https://www.google.be
www.website.gov.us
www.test.com
http://192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg

Explanations:

  • \b được sử dụng cho ranh giới từ để phân định URL và phần còn lại của văn bản
  • (?:https?://)? để phù hợp với http: // hoặc https // nếu có mặt
  • (?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6}) để phù hợp với URL tiêu chuẩn (có thể bắt đầu với www. (hãy gọi nó là ____10)
  • add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    1 để phù hợp với IPv4 tiêu chuẩn (hãy gọi nó là
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    2)
  • Để phù hợp với URL IPv6:
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    3 (hãy gọi nó là
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    4)
  • Để khớp với phần cổng (hãy gọi nó là
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    5) nếu có:
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    6
  • Để khớp với phần đối tượng đích
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    7 của URL (tệp HTML, JPG, ...) (hãy gọi nó là
    add1 http://mit.edu.com abc
    add2 https://facebook.jp.com.2. abc
    add3 www.google.be. uvw
    add4 https://www.google.be. 123
    add5 www.website.gov.us test2
    Hey bob on www.test.com. 
    another test with ipv4 http://192.168.1.1/test.jpg. toto2
    website with different port number www.test.com:8080/test.jpg not port 80
    www.website.gov.us/login.html
    test with ipv4 192.168.1.1/test.jpg.
    search at google.co.jp/maps.
    test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
    
    8)

Điều này mang lại cho Regex sau:following regex:

\b((?:https?://)?(?:STANDARD_URL|IPv4|IPv6)(?:PORT)?(?:RESSOURCE_PATH)\b

Sources:

IPv6: Biểu thức chính quy phù hợp với địa chỉ IPv6 hợp lệ

IPv4: https://www.safaribooksonline.com/l Library/view/regular-pressions-cookbook/9780596802837/CH07S16.html

Cổng: https://stackoverflow.com/a/12968117/8794221

Các nguồn khác: https://code.tutsplus.com/tutorials/8-regular-pressions-you-chould-know-- mạng-6149


$ more url.py

import re

inputString = """add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 (192.168.1.1/test.jpg).
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""

regex=ur"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b"

matches = re.findall(regex, inputString)
print(matches)

OUTPUT:

$ python url.py 
['http://mit.edu.com', 'https://facebook.jp.com', 'www.google.be', 'https://www.google.be', 'www.website.gov.us', 'www.test.com', 'http://192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']