Regex cheat sheet for python
I am writing a program that will find all the sex and sires in a raw data file and get rid of characters we don’t need anymore.
Character Classes
Character classes are a small group of characters that you need to use with an escape \. They are used as a way of catching all the different types of a certain character.
For instance, instead of going [A-Za-z] you can just do \w and that will catch all the different word characters. There are a handful of them:
| Escape character | Use | Example |
|---|---|---|
\w |
Word Characters | a-z, A-Z, 0-9, _ |
\W |
Non-Word Characters | !@#$%^^&*() |
\d |
Digits | 0-9 |
\D |
Non Digits | Not 0-9 |
\s |
White Spaces | \t, \n, \r |
\S |
Non white spaces | Opposite of above |
\b |
Boundary of a word | Basically the first or last letter of a word |
\b |
Opposite | Leaves off the first or last letter |
The lowercase and upper case are the exact opposite. \d finds all digits (0-9) and \D finds the exact opposite.
Special characters
Special characters are when things get helpful. I don’t want to type out \w\w\w\w to find every 4 letter word. Seems silly.
| Special Character | Use | Example |
|---|---|---|
^ |
Beginning of a string | ^\d |
$ |
Matches everything to it’s left at the end of a string | \w$ finds when I don’t end a line with punctuation |
. |
Anything besides \n |
|
\ |
Escape character | |
| |
OR | A|B A OR B |
{N} |
Number of times the thing to it’s left needs to be found | \w{4} Finds a word character 4 times in a row |
{N,} |
Same as above, but N or more times | \w{4,} Finds a word character 4 or more long |
{N,X} |
Finds whatever code N through X times | \w{4,6} Finds word characters that are 4 to 6 long |
* |
Finds 0 or more times | \w* Finds a word character 0 or more times |
+ |
Finds something 1 or more times. | \d+ Finds a digit 1 or more times |
? |
Sort of optional. Like the () around the area code. Sometimes they’re there other times they aren’t | \(?\d{3}\)? Finds the area code with optional () |
Sets
Sets use brackets and are used for groups of characters. An example would be [A-Z]. That would be every uppercase letter A-Z.
You also don’t need to repeat letters so a letter can appear 1 or more times. For instance if I wanted to find “James Carney” it would look like this:
[jamescrny\s]
Carney looks a little weird because it has some letters that appear in James so they don’t need to be repeated. I also added \s because there is a space between James and Carney.
| Set layout | Purpose |
|---|---|
| [ ] | Regular Set. Finds things between the brackets |
| [milk] | Finds either of those letters, but not all of them in one string |
| [a-z] | Finds lowercase letters a-z |
| [-a] | Finds - and ‘a’ since the - is at the beginning of the set |
| [^anything] | the ^ excludes anything in the bracket |
| []?*+ | Anything in the bracket, followed by the special character determines how many characters you’re looking for |