Common regexes

(.*?) (find everything)
[A-Z]{2,} (find ALLCAPs 2 or longer)
(?s) multiline (bbedit only?)
[\s\S]* multiline for Sigil (might need tweaking for brackets)
[\xC0-\xFF]+ special characters (accents, umlauts etc.) Can be useful for finding foreign language searches
[\x{00C0}-\x{017E}]+ searches extended set of unicode

Lookaheads and -behinds

Find between:
(?<=Bob is)(.*?)(?= a tool) - will find not in "Bob is not a tool."

Find everything after > but before </h2>
([^>]+)(?=<\/h2)

Find the p in <p> or </p>
(?<=\<)p|p(?=\>)
- it finds p following a < or a p followed by a >
- will find p in <pclass="toc-2">The Real Cost of Growing Food<\/p>

Find pattern not beginning with XXX
(?<!XXX)<span .*?>(.*?)</span> - it will ignore <span> preceded by XXX

Not, Or...

[^a-z]  (caret = not)
[ABC|DEF]• (| = or)

Optional Characters

? (trailing ? = previous character is optional)
- https? finds http or https

Everything but

]>BREAK

- [^>]: any characters that are not a > (to allow for other classes and/or spaces)

non-capturing group

(?:xxx) (?: = don't capture)
- (?:<p>)Mr\. (.*?)(?:</p>) will only capture Jones in <p>Mr. Jones</p>

(?:xxx)? = add optional ? if the non-capturing group occurs zero or one times - eg: <p(?: class=".*?")?> has optional class after p. So it will ignore <p> or <p class="bob">PAGEBREAK</p>

Basic Commands

.  (Any Character Except New Line)
\d (Digit (0-9))
\D (Not a Digit (0-9))
\w (Word Character (a-z, A-Z, 0-9, _))
\W (Not a Word Character)
\s (Whitespace (space, tab, newline))
\S (Not Whitespace (space, tab, newline))
\b (Word Boundary front or back)
\B (Not a Word Boundary)
^  (Beginning of a String)
$  (End of a String)

Character sets

[] (Matches Characters in brackets)
[^ ] (Matches Characters NOT in brackets)
|  (Either Or)
( ) (Group)

Unicode

\p stands for search for unicode properties - \p{P} finds all punctuation including dashes etc - \p{Lu} finds all uppercase

Quantifiers:

*  (0 or More)
+  (1 or More)
?  (0 or One)
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)

Python

re.compile method - finditer

  • r means search raw string. To find the actual string \t
    print(r'\tTab')
  • allows python to separate patterns into variables
    pattern = re.compile()
    eg:
    pattern = re.compile(r'regexstring')
    matches = pattern.finditer(text_to_search)
    for match in matches
    print(match)

returns: <sre.SRE_Match object; span(1,4), match = 'abc'>
Allows string slicing using index: 1 = start; 4 = end
- print text_to_search(1:4)
- (See string slicing [https://youtu.be/ajrtAuDg3yw])

  • to select group (a particular set in parentheses) use:
    print(match.group(0))
    group 0 is entire match
    group 1 is first set of parentheses

Samples

urls = '''
http:bob.com
http://www.apple.ca
https://www.apple.com
https://appsforus.net
'''

pattern= re.compile(r'https?://(www.)?(\w+)(.w+)')
- finds all urls and makes 3 groups: www, domain, toplevel domain

subbed_urls = pattern.sub(r'\2\3', urls)
- finds group 2 & 3 (domain, toplevel domain) then substitutes whole string with those
- http:bob.com --> bob.com
- http://www.apple.ca --> apple.ca

re.compile method - findall method

matches = pattern.findall(text_to_search)
- just returns matches as list of strings
- multiple groups returns list of tuples

re.compile method - match method

matches = pattern.match(text_to_search)
- just returns first match at beginning of string — not iterable (no loops)
- same info as finditer

re.compile method - search method

matches = pattern.search(text_to_search)
- just returns first match — not iterable (no loops)
- same info as finditer

Flags

re.IGNORECASE
- pattern = re.compile(r'start', re.IGNORECASE)
- shorthand is re.I

re.MULTILINE -re.M