Common regexes
(.*?) (find everything)
[A-Z]{2,} (find ALLCAPs 2 or longer)
(?s) multiline (bbedit only?)
[\s\S]* multiline for Sigil (might need tweaking for brackets)
[\xC0-\xFF]+ special characters (accents, umlauts etc.) Can be useful for finding foreign language searches
[\x{00C0}-\x{017E}]+ searches extended set of unicode
Lookaheads and -behinds
Find between:
(?<=Bob is)(.*?)(?= a tool)
- will find not in "Bob is not a tool."
Find everything after > but before </h2>
([^>]+)(?=<\/h2)
Find the p in <p> or </p>
(?<=\<)p|p(?=\>)
- it finds p following a < or a p followed by a >
- will find p in <pclass="toc-2">The Real Cost of Growing Food<\/p>
Find pattern not beginning with XXX
(?<!XXX)<span .*?>(.*?)</span>
- it will ignore <span> preceded by XXX
Not, Or...
[^a-z] (caret = not)
[ABC|DEF]• (| = or)
Optional Characters
? (trailing ? = previous character is optional)
- https? finds http or https
Everything but
]>BREAK
- [^>]: any characters that are not a > (to allow for other classes and/or spaces)non-capturing group
(?:xxx) (?: = don't capture)
- (?:<p>)Mr\. (.*?)(?:</p>) will only capture Jones in <p>Mr. Jones</p>
(?:xxx)? = add optional ? if the non-capturing group occurs zero or one times
- eg: <p(?: class=".*?")?> has optional class after p. So it will ignore <p> or <p class="bob">PAGEBREAK</p>
Basic Commands
. (Any Character Except New Line)
\d (Digit (0-9))
\D (Not a Digit (0-9))
\w (Word Character (a-z, A-Z, 0-9, _))
\W (Not a Word Character)
\s (Whitespace (space, tab, newline))
\S (Not Whitespace (space, tab, newline))
\b (Word Boundary front or back)
\B (Not a Word Boundary)
^ (Beginning of a String)
$ (End of a String)
Character sets
[] (Matches Characters in brackets)
[^ ] (Matches Characters NOT in brackets)
| (Either Or)
( ) (Group)
Unicode
\p stands for search for unicode properties
- \p{P} finds all punctuation including dashes etc
- \p{Lu} finds all uppercase
Quantifiers:
* (0 or More)
+ (1 or More)
? (0 or One)
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)
Python
re.compile method - finditer
- r means search raw string. To find the actual string \t
print(r'\tTab') - allows python to separate patterns into variables
pattern = re.compile()
eg:
pattern = re.compile(r'regexstring')
matches = pattern.finditer(text_to_search)
for match in matches
print(match)
returns: <sre.SRE_Match object; span(1,4), match = 'abc'>
Allows string slicing using index: 1 = start; 4 = end
- print text_to_search(1:4)
- (See string slicing [https://youtu.be/ajrtAuDg3yw])
- to select group (a particular set in parentheses) use:
print(match.group(0))
group 0 is entire match
group 1 is first set of parentheses
Samples
urls = '''
http:bob.com
http://www.apple.ca
https://www.apple.com
https://appsforus.net
'''
pattern= re.compile(r'https?://(www.)?(\w+)(.w+)')
- finds all urls and makes 3 groups: www, domain, toplevel domain
subbed_urls = pattern.sub(r'\2\3', urls)
- finds group 2 & 3 (domain, toplevel domain) then substitutes whole string with those
- http:bob.com --> bob.com
- http://www.apple.ca --> apple.ca
re.compile method - findall method
matches = pattern.findall(text_to_search)
- just returns matches as list of strings
- multiple groups returns list of tuples
re.compile method - match method
matches = pattern.match(text_to_search)
- just returns first match at beginning of string — not iterable (no loops)
- same info as finditer
re.compile method - search method
matches = pattern.search(text_to_search)
- just returns first match — not iterable (no loops)
- same info as finditer
Flags
re.IGNORECASE
- pattern = re.compile(r'start', re.IGNORECASE)
- shorthand is re.I
re.MULTILINE -re.M