Regex cheat sheet for python

I am writing a program that will find all the sex and sires in a raw data file and get rid of characters we don’t need anymore.

Character Classes

Character classes are a small group of characters that you need to use with an escape \. They are used as a way of catching all the different types of a certain character.

For instance, instead of going [A-Za-z] you can just do \w and that will catch all the different word characters. There are a handful of them:

Escape character	Use	Example
`\w`	Word Characters	a-z, A-Z, 0-9, _
`\W`	Non-Word Characters	!@#$%^^&*()
`\d`	Digits	0-9
`\D`	Non Digits	Not 0-9
`\s`	White Spaces	`\t, \n, \r`
`\S`	Non white spaces	Opposite of above
`\b`	Boundary of a word	Basically the first or last letter of a word
`\b`	Opposite	Leaves off the first or last letter

The lowercase and upper case are the exact opposite. \d finds all digits (0-9) and \D finds the exact opposite.

Special characters

Special characters are when things get helpful. I don’t want to type out \w\w\w\w to find every 4 letter word. Seems silly.

Special Character	Use	Example
`^`	Beginning of a string	`^\d`
`$`	Matches everything to it’s left at the end of a string	`\w$` finds when I don’t end a line with punctuation
`.`	Anything besides `\n`
`\`	Escape character
`\|`	OR	`A\|B` A OR B
`{N}`	Number of times the thing to it’s left needs to be found	`\w{4}` Finds a word character 4 times in a row
`{N,}`	Same as above, but N or more times	`\w{4,}` Finds a word character 4 or more long
`{N,X}`	Finds whatever code N through X times	`\w{4,6}` Finds word characters that are 4 to 6 long
`*`	Finds 0 or more times	`\w*` Finds a word character 0 or more times
`+`	Finds something 1 or more times.	`\d+` Finds a digit 1 or more times
`?`	Sort of optional. Like the () around the area code. Sometimes they’re there other times they aren’t	`$?\d{3}$?` Finds the area code with optional ()

Sets

Sets use brackets and are used for groups of characters. An example would be [A-Z]. That would be every uppercase letter A-Z.

You also don’t need to repeat letters so a letter can appear 1 or more times. For instance if I wanted to find “James Carney” it would look like this:

[jamescrny\s]

Carney looks a little weird because it has some letters that appear in James so they don’t need to be repeated. I also added \s because there is a space between James and Carney.

Set layout	Purpose
[ ]	Regular Set. Finds things between the brackets
[milk]	Finds either of those letters, but not all of them in one string
[a-z]	Finds lowercase letters a-z
[-a]	Finds `-` and ‘a’ since the `-` is at the beginning of the set
[^anything]	the `^` excludes anything in the bracket
[]?*+	Anything in the bracket, followed by the special character determines how many characters you’re looking for