Abstract: I feel the regular expression is a monster, in both ways. Hard to learn, hard to use. Because the dude who concept this is in the 1950s and at this time everything is very C-style (raw and not beautiful). However, for some reason, I have to master it... and I failed every time. Well... not until this time.
In this post, I'm going to construct some dirty basic notes for you, as usual, to let you quickly conquer problems and know where to look back after a few month later.
Result Goes First.
Our goal is trying to get the phone numbers from some sketches that contain different formats of the phone book.
What is the point then?
There are a lot of tricky operators in the regex. The key to learn is not to learn them all at once and
learn wisely by reversing the lookup dictionary.
- Analysis the data pattern
- Think about what you need in computer language
- Go to python website and find out what is the operator(character) needed
It is easy to see that our phone has the pattern of "three - three - four - extension".
So all the remaining is how to solve the connectors and you realize the type of the connector need to be something not a digit + the length of the connection can be any size from 0 to inf. Lastly is the extension part that can also be omitted with an uncertain length.
And you should be able to get the results.
Basic Indicators Groups
All you have to do is construct a string pattern with some indicators such that that pattern you made can mask out something similar in a query string. In general, there are two types of indicators, you should maintain crystal clear when constructing the pattern.
- Type A: Content indicator that decides what character should show up in this position.
- Type B: Counting indicator that tells you how many characters should extend from this position.
I want you to feel this by yourself and for the
[abcde to whatever] indicator, I like to write it up vertically so I warn myself it is only the choices for the current position.
How to use it in Python
I have some shorthands here to share. Maybe it's better to get some sidenotes in Chinese as well such that they can also understand what's going on here. The reason for using
re.compile is it will return a regex object so you don't have to input the pattern variable every time.
Make sure yourself is clear what's above and remember that don't greedy on regular expression too much on the same day. Otherwise, you will remember nothing.
I prepared some extended questions for you to do more research just in case you need.
- What is a raw string, and why you need it for your pattern string?
- What is the loose format of a pattern string?
- Being too greedy?
The phone book example is taken from <