Common regexes

(.*?) (find everything)
[A-Z]{2,} (find ALLCAPs 2 or longer)
(?s) multiline (bbedit only?)
[\s\S]*? multiline for Sigil (might need tweaking for brackets)
e.g.

<aside aria-labelledby="_idHeading(\d+)" class="dark">([\s\S]*?)</aside>

<aside aria-labelledby="_idHeading\1" class="dark">
    <div class="dark">
        \2
    </div>
</aside>

[\xC0-\xFF]+ special characters (accents, umlauts etc.) Can be useful for finding foreign language searches
[\x{00C0}-\x{017E}]+ searches extended set of unicode

Lookaheads and -behinds

Find between:
(?<=Bob is)(.*?)(?= a tool) - will find not in "Bob is not a tool."

Find everything after > but before </h2>
([^>]+)(?=<\/h2)

Find the p in  or 
(?<=\<)p|p(?=\>)
- it finds p following a < or a p followed by a >
- will find p in <pclass="toc-2">The Real Cost of Growing Food<\/p>

Find pattern not beginning with XXX
(?<!XXX)(.*?) - it will ignore  preceded by XXX

Find only digits followed by comma or < or n-dash (good for index)
(\d+)(?=[,<–])

Not, Or...

[^a-z] (caret = not)
[ABC|DEF]• (| = or)

Optional Characters

? (trailing ? = previous character is optional)
- https? finds http or https

Everything but

]>BREAK

- [^>]: any characters that are not a > (to allow for other classes and/or spaces)

non-capturing group

(?:xxx) (?: = don't capture)
- (?:)Mr\. (.*?)(?:) will only capture Jones in Mr. Jones

(?:xxx)? = add optional ? if the non-capturing group occurs zero or one times - eg: <p(?: class=".*?")?> has optional class after p. So it will ignore  or PAGEBREAK

Basic Commands

. (Any Character Except New Line)
\d (Digit (0-9))
\D (Not a Digit (0-9))
\w (Word Character (a-z, A-Z, 0-9, _))
\W (Not a Word Character)
\s (Whitespace (space, tab, newline))
\S (Not Whitespace (space, tab, newline))
\b (Word Boundary front or back)
\B (Not a Word Boundary)
^ (Beginning of a String)
$ (End of a String)

Character sets

[] (Matches Characters in brackets)
[^ ] (Matches Characters NOT in brackets)
| (Either Or)
( ) (Group)

Unicode

\p stands for search for unicode properties - \p{P} finds all punctuation including dashes etc - \p{Lu} finds all uppercase

Quantifiers:

* (0 or More)
+ (1 or More)
? (0 or One)
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)

Python

re.compile method - finditer

r means search raw string. To find the actual string \t
print(r'\tTab')
allows python to separate patterns into variables
pattern = re.compile()
eg:
pattern = re.compile(r'regexstring')
matches = pattern.finditer(text_to_search)
for match in matches
print(match)

returns: <sre.SRE_Match object; span(1,4), match = 'abc'>
Allows string slicing using index: 1 = start; 4 = end
- print text_to_search(1:4)
- (See string slicing [https://youtu.be/ajrtAuDg3yw])

to select group (a particular set in parentheses) use:
print(match.group(0))
group 0 is entire match
group 1 is first set of parentheses

Samples

urls = '''
http:bob.com
http://www.apple.ca
https://www.apple.com
https://appsforus.net
'''

pattern= re.compile(r'https?://(www.)?(\w+)(.w+)')
- finds all urls and makes 3 groups: www, domain, toplevel domain

subbed_urls = pattern.sub(r'\2\3', urls)
- finds group 2 & 3 (domain, toplevel domain) then substitutes whole string with those
- http:bob.com --> bob.com
- http://www.apple.ca --> apple.ca

re.compile method - findall method

matches = pattern.findall(text_to_search)
- just returns matches as list of strings
- multiple groups returns list of tuples

re.compile method - match method

matches = pattern.match(text_to_search)
- just returns first match at beginning of string — not iterable (no loops)
- same info as finditer

re.compile method - search method

matches = pattern.search(text_to_search)
- just returns first match — not iterable (no loops)
- same info as finditer

Flags

re.IGNORECASE
- pattern = re.compile(r'start', re.IGNORECASE)
- shorthand is re.I

re.MULTILINE -re.M

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search