Skip to main content

Regular Expression

Character types
  • \w matches with any alphanumeric character, including underline
  • . matches to all characters, including symbols

  • \d matches to all single digits [0-9]

  • \s matches to all single space, tab and new line 

  • \. matches to the dot(period) character

  • [a-z] 小寫英文字母 a-z
  • [A-Z] 大寫英文字母 A-Z
  • [^a-z] 非小寫英文字母 a-z
  • [0-9] 數字 0-9
  • [^0-9] 反向列舉,任意非數字 
  • ^ 行首
  • $ 行尾
  • | 左邊字元或右邊字元
  • p?each 字元 each 前方包含 0 個或 1 個 p 字元,eachpeach 
import re
re.findall("\w", "h32rb17")

import re
re.findall("\d", "h32rb17")
Quantify occurrences

次數符號,限定符號

  • + : symbol represents one or more occurrences of a specific character. It's the same as {1,} 
  • * : symbol represents zero, one, or more occurrences of a specific character. It's the same as {0,} 
  • ? : 重複 0 或 1 次
  • {n} : 重複 n 次
  • {n,} : 重複 n 次以上
  • {0,n} : 重複 0 - n 次
  • {n,m} : 重複 n - m 次
  • \d{2} instructs Python to return all matches of exactly two single digits
  • \d{1,3} 數字 1 - 3 位數
Functions

.findall(<regex>, <string>) 

  • 搜尋符合的所有字元
  • 輸出格式 List 
import re
re.findall("\d+", "h32rb17")

import re
re.findall("\d*", "h32rb17")

import re
re.findall("\d{2}", "h32rb17 k825t0m c2994eh")

import re
re.findall("\d{1,3}", "h32rb17 k825t0m c2994eh")
import re
pattern = "\w+:\s\d+"
employee_logins_string = "1001 bmoreno: 12 Marketing 1002 tshah: 7 Human Resources 1003 sgilmore: 5 Finance"
print(re.findall(pattern, employee_logins_string))
['bmoreno: 12', 'tshah: 7', 'sgilmore: 5']

.search(<regex>, <string>, re.IGNORECASE) 

  • r"regex" : r 表示 raw string,Python 直譯器不會解譯該字串,而是直接傳給函式
  • 只搜尋符合的第一個字元
  • 輸出格式 Match Class
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)

print(result)     # Output: <_sre.SRE_Match object; span=(39, 46), match='[12345]'>
print(result[1])  # Output: 12345
import re
print(re.search(r"[Pp]ython", "Python"))

# Output: <_sre.SRE_Match object; span=(0, 6), match='Python'>
import re
print(re.search(r"Py.*n", "Pygmalion")) 
print(re.search(r"Py.*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Pyn"))

# Output:
# <_sre.SRE_Match object; span=(0, 9), match='Pygmalion'>
# <_sre.SRE_Match object; span=(0, 17), match='Python Programmin'>
# <_sre.SRE_Match object; span=(0, 6), match='Python'>
# <_sre.SRE_Match object; span=(0, 3), match='Pyn'>
import re
print(re.search(r"o+l+", "goldfish"))
print(re.search(r"o+l+", "woolly"))
print(re.search(r"o+l+", "boil"))

# Output:
# <_sre.SRE_Match object; span=(1, 3), match='ol'>
# <_sre.SRE_Match object; span=(1, 5), match='ooll'>
# None

Regex examples
  • r"\d{3}-\d{3}-\d{4}"   This line of code matches U.S. phone numbers in the format 111-222-3333.
  • r"^-?\d*(\.\d+)?$"   任何正數或負數,不論是否有小數位數
  • r"^(.+)\/([^\/]+)\/"  任何檔案路徑

IP addr.
# Assign `log_file` to a string containing username, date, login time, and IP address for a series of login attempts 
log_file = "eraab 2022-05-10 6:03:41 192.168.152.148 \niuduike 2022-05-09 6:46:40 192.168.22.115 \nsmartell 2022-05-09 19:30:32 192.168.190.178 \narutley 2022-05-12 17:00:59 1923.1689.3.24 \nrjensen 2022-05-11 0:59:26 192.168.213.128 \naestrada 2022-05-09 19:28:12 1924.1680.27.57 \nasundara 2022-05-11 18:38:07 192.168.96.200 \ndkot 2022-05-12 10:52:00 1921.168.1283.75 \nabernard 2022-05-12 23:38:46 19245.168.2345.49 \ncjackson 2022-05-12 19:36:42 192.168.247.153 \njclark 2022-05-10 10:48:02 192.168.174.117 \nalevitsk 2022-05-08 12:09:10 192.16874.1390.176 \njrafael 2022-05-10 22:40:01 192.168.148.115 \nyappiah 2022-05-12 10:37:22 192.168.103.10654 \ndaquino 2022-05-08 7:02:35 192.168.168.144"

# Assign `pattern` to a regular expression that matches with all valid IP addresses and only those 
pattern = "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

# Use `re.findall()` on `pattern` and `log_file` and assign `valid_ip_addresses` to the output 
valid_ip_addresses = re.findall(pattern, log_file)

# Assign `flagged_addresses` to a list of IP addresses that have been previously flagged for unusual activity
flagged_addresses = ["192.168.190.178", "192.168.96.200", "192.168.174.117", "192.168.168.144"]

# Iterative statement begins here
# Loop through `valid_ip_addresses` with `address` as the loop variable
for address in valid_ip_addresses:

    # Conditional begins here
    # If `address` belongs to `flagged_addresses`, display "The IP address ______ has been flagged for further analysis."
    if address in flagged_addresses:
        print("The IP address", address, "has been flagged for further analysis.")

    # Otherwise, display "The IP address ______ does not require further analysis."
    else:
        print("The IP address", address, "does not require further analysis.")
檢查字串函式

回傳結果 True 或 False

import re
def check_aei (text):
  result = re.search(r".*a.+e.+i.*", text)
  return result != None

print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True

函式: 檢查字串是否有包含任何標點符號

import re
def check_punctuation (text):
  result = re.search(r"[^a-zA-Z ]", text)
  return result != None

print(check_punctuation("This is a sentence that ends with a period.")) # True
print(check_punctuation("This is a sentence fragment without a period")) # False
print(check_punctuation("Aren't regular expressions awesome?")) # True

函式:check web address

import re
def check_web_address(text):
  pattern = r"[\w-]*\.[a-zA-Z]*$"
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True

函式:check time

import re
def check_time(text):
  pattern = r"[1-9|10|11|12]:[0-5][0-9] *[AaPp][mM]$"
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False
print(check_time("6:02 am")) # True
print(check_time("6:02km")) # False

函式:括號內的字首需大寫字母或數字

import re
def contains_acronym(text):
  pattern = r"\([0-9A-Z][a-zA-z]*\)" 
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True
Capturing Groups
  • 用途:取出字串中符合 Regex 規則的不同文字區段
  • 特定字元區段的 Regex 可用括號定義成群組
  • 多個括號時,依序為群組1,群組2
  • .groups() method : 輸出 tuple 格式資料,例如 (group1, group2, group3)
  • result[0]: 完整字串 ,result[1]: 群組1, result[2]: 群組2
import re
result = re.search(r"^(\w*), (\w*)$", "Lovelace, Ada")
print(result)
print(result.groups())
print(result[0])
print(result[1])
print(result[2])
"{} {}".format(result[2], result[1])

# Output
# <_sre.SRE_Match object; span=(0, 13), match='Lovelace, Ada'>
# ('Lovelace', 'Ada')
# Lovelace, Ada
# Lovelace
# Ada
# Ada Lovelace

Resources