Regular expressions: Difference between revisions
Jump to navigation
Jump to search
Content deleted Content added
m Add whitespace |
m Those blocks start with lowercase letters |
||
(10 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
[[File:RegexpHighlight.webp|thumb|500px|alt=Lorem ipsum first paragraph where every consonant-vowel pair is highlighted|The alternating <mark style="background:rgb(255,255,0);color:rgb(0,0,0)">yellow</mark> and <mark style="background:rgb(255,221,136);color:rgb(0,0,0);">orange</mark> highlights show results for the following regexp pattern: <code>/[a-z](?<![aeiou])[aeiou]/gi</code> (any consonant-vowel pair)]] |
[[File:RegexpHighlight.webp|thumb|500px|alt=Lorem ipsum first paragraph where every consonant-vowel pair is highlighted|The alternating <mark style="background:rgb(255,255,0);color:rgb(0,0,0)">yellow</mark> and <mark style="background:rgb(255,221,136);color:rgb(0,0,0);">orange</mark> highlights show results for the following regexp pattern: <code>/[a-z](?<![aeiou])[aeiou]/gi</code> (any consonant-vowel pair)]] |
||
'''Regular expression''', often |
'''Regular expression''', often shortened to '''regex''', is used to specify a [[w:Pattern matching|match pattern]] with just text. |
||
== Syntax == |
== Syntax == |
||
Line 38: | Line 38: | ||
|- |
|- |
||
| <code>\</code> || Escape character || If a character is reserved for regex, such as <code><nowiki>*</nowiki></code>, <code><nowiki>|</nowiki></code>, or <code>.</code>. Note that this is itself a reserve character, so to match for it, you need to use <code>\\</code>. || "Foo.bar apple 78.9 banana" <code><nowiki>/[A-Za-z0-9]*\.[A-Za-z0-9]*/g</nowiki></code> -> <code>[ "Foo.bar", "78.9" ]</code> |
| <code>\</code> || Escape character || If a character is reserved for regex, such as <code><nowiki>*</nowiki></code>, <code><nowiki>|</nowiki></code>, or <code>.</code>. Note that this is itself a reserve character, so to match for it, you need to use <code>\\</code>. || "Foo.bar apple 78.9 banana" <code><nowiki>/[A-Za-z0-9]*\.[A-Za-z0-9]*/g</nowiki></code> -> <code>[ "Foo.bar", "78.9" ]</code> |
||
|- |
|||
| <code>\d</code> || Digit character class || Equivalent to <code>[0-9]</code> || "78 Foo Bars" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "7", "8" ]</code> |
|||
|- |
|||
| <code>\D</code> || Non-digit character class || Equivalent to <code>[^0-9]</code> || "78 Foo Bars" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "F", "o"," "o", " ", "B", "a", "r", "s" ]</code> |
|||
|- |
|||
| <code>\w</code> || Word character class || Equivalent to <code>[A-Za-z0-9_]</code> || "_Foo- Bars+" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "_", "F"," "o", "o", "B", "a", "r", "s" ]</code> |
|||
|- |
|||
| <code>\W</code> || Non-word character class || Equivalent to <code>[^A-Za-z0-9_]</code> || "_Foo- Bars+" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "-", " ", "+" ]</code> |
|||
|- |
|||
| <code>\s</code> || White space character class || Matches all whitespace characters. Equivalent to <code>[\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]</code> || "_Foo- Bars+" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "-", " ", "+" ]</code> |
|||
|- |
|||
| <code>\S</code> || Non-white space character class || Matches everything but whitespace characters. Equivalent to <code>[^\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]</code> || "_Foo- Bars+" <code><nowiki>/\d/g</nowiki></code> -> <code>[ "-", " ", "+" ]</code> |
|||
|- |
|||
| <code>\t</code> || Horizontal tab || Matches horizontal tab characters. || "a b" <code><nowiki>/\t/g</nowiki></code> -> <code>[ " " ]</code> |
|||
|- |
|||
| <code>\n</code> || New line || Matches linefeed/new line characters || rowspan="4" | "a<br>b" <code><nowiki>/(?:\r?\n)|(?:\v)|(?:\f)/g</nowiki></code> -> <code>[ "" ]</code> |
|||
|- |
|||
| <code>\r</code> || Carriage return || Matches carriage return characters |
|||
|- |
|||
| <code>\v</code> || Vertical tab || Matches vertical tab characters |
|||
|- |
|||
| <code>\f</code> || Form feed || Matches form feed characters |
|||
|- |
|||
| <code>[\b]</code> || Backspace || Matches backspace || rowspan="5" | No example can be provided |
|||
|- |
|||
| <code>\0</code> || NUL || Matches the NUL character |
|||
|- |
|||
| <code>\u{YYYY}</code> or <code>\u{YYYY}</code> || rowspan="2" | Unicode value escape || When the <code>u</code> flag is applied. Here <code>Y</code> represents a hexadecimal number. |
|||
|- |
|||
| <code>\uYYYY</code> || Matches provided UTF-16 hexadecimal value. Represented with <code>Y</code>s here. |
|||
|- |
|||
| <code>\p{x}</code> or <code>\P{x}</code> || Unicode character class || Matches a character based on the Unicode property (<code>x</code>). |
|||
|- |
|||
| <code>\cx</code> || Caret notation escape || Matches the sequence following <code>\c</code> with [[w:Caret notation|caret notation]]. Note that <code>x</code> represents a sequence of characters here, rather than a single one. || "a<br>b" <code><nowiki>/\cM\cJ//g</nowiki></code> -> <code><nowiki>[ "" ]</nowiki></code> |
|||
|- |
|||
! colspan=4 | Assertions |
|||
|- |
|||
| <code>^</code> || Input boundary beginning || Matches the beginning of the input. If the <code>m</code> flag is on, it matches the start of each line. || rowspan="2" | "Foo Bar" <code>/(^Foo)|(Bar$)/g</code> -> <code>[ "Foo", "Bar" ]</code> |
|||
|- |
|||
| <code>$</code> || Input boundary end || Matches the end of the input. If the <code>m</code> flag is on, it matches the end of each line. |
|||
|- |
|||
| <code>\b</code> || Word boundary || Matches either end of a word. || "Foo Bar" <code>/(\bFoo\b)/</code> -> <code>[ "Foo" ]</code> |
|||
|- |
|||
| <code>\B</code> || Non-word boundary || Matches the middle of a word. || "Foo Bar" <code>/(B\Bar)/</code> -> <code>[ "Bar" ]</code> |
|||
|- |
|||
| <code>x(?=y)</code> || Positive lookahead || Matches if <code>y</code> is after <code>x</code>, but doesn't include <code>y</code> in the output. || "Foo Bar" <code>/Foo(?= Bar)/</code> -> <code>[ "Foo" ]</code> |
|||
|- |
|||
| <code>x(?!y)</code> || Negative lookahead || Matches if <code>y</code> is not after <code>x</code>, but doesn't include <code>y</code> in the output. || "Foo Bar" <code>/Foo(?! Car)/</code> -> <code>[ "Foo" ]</code> |
|||
|- |
|||
| <code>(?<=x)y</code> || Positive lookbehind || Matches if <code>y</code> is before <code>x</code>, but doesn't include <code>y</code> in the output. || "Foo Bar" <code>/(?<=Foo )Bar/</code> -> <code>[ "Bar" ]</code> |
|||
|- |
|||
| <code>(?<!x)y</code> || Negative lookbehind || Matches if <code>y</code> is before <code>x</code>, but doesn't include <code>y</code> in the output. || "Foo Bar" <code>/(?<!Moo )Bar/</code> -> <code>[ "Bar" ]</code> |
|||
|- |
|||
! colspan=4 | Quantifiers |
|||
|- |
|||
| <code>x*</code> || Wild-amount || Matches <code>x</code> any number of times, including 0. || "Foo Foo Foo Bar" <code>/(?:Foo )*Bar/g</code> -> <code>[ "Foo Foo Foo Bar" ]</code> |
|||
|- |
|||
| <code>x+</code> || Wild-1-or-more || Matches <code>x</code> if it occurs 1 or more times. || "Foo Bar Bar" <code>/(Foo)+ (Bar)+/</code> -> <code>[ "Foo Bar Bar" ]</code> |
|||
|- |
|||
| <code>x?</code> || Can occur || Matches <code>x</code> if it occurs, otherwise, ignore it. || "Foo " <code>/Foo (Bar)?/</code> -> <code>[ "Foo " ]</code> |
|||
|- |
|||
| <code>x{Y}</code> || Occurs set times || Matches if <code>x</code> occurs <code>Y</code> times. || "Foo Bar Bar" <code>/Fo{2} (?:Bar\s?){2}/</code> <code>[ "Foo Bar Bar" ]</code> |
|||
|- |
|||
| <code>x{Y,Z}</code> || Occurs between set times || Matches if <code>x</code> occurs <code>Y</code> and <code>Z</code> times. || "Foooo Bar Bar Bar Bar Bar" <code>/Fo{2,5} (?:Bar\s?){1,10}/</code> <code>[ "Foooo Bar Bar Bar Bar Bar" ]</code> |
|||
|- |
|||
| <code>x*?</code>, <code>x+?</code>, <code>x??</code>, <code>x{Y}?</code>, or <code>x{Y,Z}?</code> || Lazy match || Matches <code>x</code> the least number of times possible, in accordance to the base rule. || "Foooo Bar Bar Bar Bar Bar" <code>/Fo{2,5} (?:Bar\s??){1,10}?/</code> <code>[ "Foooo Bar" ]</code> |
|||
|} |
|} |
||
== Flags == |
== Flags == |
||
Whilst there are flags other than the following, they are either non-standard, or do not have a baring on PenguinMod. |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 48: | Line 116: | ||
| <code>g</code> || <code>g</code>lobal || Search all of a string, rather than stopping once you find an occurrence. |
| <code>g</code> || <code>g</code>lobal || Search all of a string, rather than stopping once you find an occurrence. |
||
|- |
|- |
||
| <code>i</code> || Case <code>i</code>nsensitive || The search will ignore the case of characters, making <code><nowiki>/[A-Za-z]g</nowiki></code> and <code>/[a-z]/gi</code> equivalent. |
|||
|- |
|||
| <code>m</code> || <code>m</code>ultiline|| Makes <code>^</code> and <code>$</code> match the start and end of lines rather than the start and end of strings. |
|||
|- |
|||
| <code>s</code> || <code>s</code>ingle line/dot all || Makes <code>.</code> able to match all line terminators: <code>\n</code>, <code>\r</code>, <code>\u2028</code>, and <code>\u2029</code>. |
|||
|- |
|||
| <code>u</code> || <code>u</code>nicode || Makes the pattern treated as a sequence of unicode codepoints. |
|||
|- |
|||
| <code>v</code> || unicode upgrade || Similar to <code>u</code>, but updated with more features. |
|||
|} |
|} |
||
== See also == |
== See also == |
||
* [[Match () with regex () ()|<sb>( |
* [[Match () with regex () ()|<sb>(match [foo bar] with regex [foo] [g]:: operators)</sb>]] |
||
* [[Test regex () () with text ()|<sb>< |
* [[Test regex () () with text ()|<sb><test regex [foo bar] [g] with text [foo]:: sensing></sb>]] |
||
== External links == |
|||
* [[w:Regular expressions|Regular expressions]] on Wikipedia |
|||
* [https://extensions.turbowarp.org/ TurboWarp extension gallery] featuring [https://scratch.mit.edu/users/TrueFantom/ TrueFantom]'s RegExp extension. It can be loaded into PenguinMod using <code><nowiki>https://extensions.turbowarp.org/true-fantom/regexp.js</nowiki></code> as the URL in the ''Load Custom Extensions'' popup. It adds more regex functionality into PenguinMod. |
|||
* [https://regex101.com/ regex101], fairly useful little app with some fun challenges to test your knowledge of regex. |
|||
* [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions MDN's Regular expressions documentation] for JavaScript. There wasn't a good place to cite this, but I sourced at lot of stuff from here. Pretty much all of the names for each syntax element. |
Latest revision as of 22:27, 8 July 2024
Regular expression, often shortened to regex, is used to specify a match pattern with just text.
Syntax
x
, y
, and z
when used under symbols are placeholders for text. Capital X
s, Y
s, and Z
s are used for number placeholders.
Symbol(s) | Name | Description | Example |
---|---|---|---|
Groups and backreferences | |||
(x) |
Capture group | Separates the content in the output. | "Foo Bar" /(Foo)|(Bar)/g -> [ "Foo", "Bar" ]
|
(?:x) |
Non-capture group | Acts as if the parentheses were not there | "Foo Bar" /(?:Foo)|(?:Bar)/g -> [ "Foo Bar" ]
|
(?<y>x) |
Named capture group | Equivalent to (x) , except it remembers the content used. |
"Foo Bar" /(?<F>Foo)|(?<B>Bar)/g -> [ "Foo", "Bar" ]
|
\k<y> |
Named backreference | References a previous named capture group, note that \k is literal |
"Foo Foo" /(?<Foo>Foo)\s\k<Foo>/g -> [ "Foo Foo" ]
|
Character classes | |||
[x-z] |
Character class | Matches every letter or number from x to z . |
"Foo Bar" /[a-f]/gi -> [ "F", "B", "a" ]
|
[xyz] |
References either x , y , or z |
"Foo Bar" /[FB]/g -> [ "F", "B" ]
| |
[^x-z] |
Negated character class | Matches every letter or number not from x to z . |
"Foo Bar" /[^a-f]/gi -> [ "o", "o", " ", "r" ]
|
[^xyz] |
References characters that aren't x , y , or z |
"Foo Bar" /[^FB]/g -> [ "o", "o", " ", "a", "r" ]
| |
. |
Wildcard | Matches every character besides line terminators. Line terminators include \n , \r , \u2028 , and \u2029 |
"Foo Bar" /./g -> [ "F", "o", "o", " ", "B", "a", "r" ]
|
x|y |
Disjunction | Match something or something else. | "Foo Bar" /Foo|Bar/g -> [ "Foo", "Bar" ]
|
\ |
Escape character | If a character is reserved for regex, such as * , | , or . . Note that this is itself a reserve character, so to match for it, you need to use \\ . |
"Foo.bar apple 78.9 banana" /[A-Za-z0-9]*\.[A-Za-z0-9]*/g -> [ "Foo.bar", "78.9" ]
|
\d |
Digit character class | Equivalent to [0-9] |
"78 Foo Bars" /\d/g -> [ "7", "8" ]
|
\D |
Non-digit character class | Equivalent to [^0-9] |
"78 Foo Bars" /\d/g -> [ "F", "o"," "o", " ", "B", "a", "r", "s" ]
|
\w |
Word character class | Equivalent to [A-Za-z0-9_] |
"_Foo- Bars+" /\d/g -> [ "_", "F"," "o", "o", "B", "a", "r", "s" ]
|
\W |
Non-word character class | Equivalent to [^A-Za-z0-9_] |
"_Foo- Bars+" /\d/g -> [ "-", " ", "+" ]
|
\s |
White space character class | Matches all whitespace characters. Equivalent to [\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] |
"_Foo- Bars+" /\d/g -> [ "-", " ", "+" ]
|
\S |
Non-white space character class | Matches everything but whitespace characters. Equivalent to [^\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] |
"_Foo- Bars+" /\d/g -> [ "-", " ", "+" ]
|
\t |
Horizontal tab | Matches horizontal tab characters. | "a b" /\t/g -> [ " " ]
|
\n |
New line | Matches linefeed/new line characters | "a b" /(?:\r?\n)|(?:\v)|(?:\f)/g -> [ "" ]
|
\r |
Carriage return | Matches carriage return characters | |
\v |
Vertical tab | Matches vertical tab characters | |
\f |
Form feed | Matches form feed characters | |
[\b] |
Backspace | Matches backspace | No example can be provided |
\0 |
NUL | Matches the NUL character | |
\u{YYYY} or \u{YYYY} |
Unicode value escape | When the u flag is applied. Here Y represents a hexadecimal number.
| |
\uYYYY |
Matches provided UTF-16 hexadecimal value. Represented with Y s here.
| ||
\p{x} or \P{x} |
Unicode character class | Matches a character based on the Unicode property (x ).
| |
\cx |
Caret notation escape | Matches the sequence following \c with caret notation. Note that x represents a sequence of characters here, rather than a single one. |
"a b" /\cM\cJ//g -> [ "" ]
|
Assertions | |||
^ |
Input boundary beginning | Matches the beginning of the input. If the m flag is on, it matches the start of each line. |
"Foo Bar" /(^Foo)|(Bar$)/g -> [ "Foo", "Bar" ]
|
$ |
Input boundary end | Matches the end of the input. If the m flag is on, it matches the end of each line.
| |
\b |
Word boundary | Matches either end of a word. | "Foo Bar" /(\bFoo\b)/ -> [ "Foo" ]
|
\B |
Non-word boundary | Matches the middle of a word. | "Foo Bar" /(B\Bar)/ -> [ "Bar" ]
|
x(?=y) |
Positive lookahead | Matches if y is after x , but doesn't include y in the output. |
"Foo Bar" /Foo(?= Bar)/ -> [ "Foo" ]
|
x(?!y) |
Negative lookahead | Matches if y is not after x , but doesn't include y in the output. |
"Foo Bar" /Foo(?! Car)/ -> [ "Foo" ]
|
(?<=x)y |
Positive lookbehind | Matches if y is before x , but doesn't include y in the output. |
"Foo Bar" /(?<=Foo )Bar/ -> [ "Bar" ]
|
(?<!x)y |
Negative lookbehind | Matches if y is before x , but doesn't include y in the output. |
"Foo Bar" /(?<!Moo )Bar/ -> [ "Bar" ]
|
Quantifiers | |||
x* |
Wild-amount | Matches x any number of times, including 0. |
"Foo Foo Foo Bar" /(?:Foo )*Bar/g -> [ "Foo Foo Foo Bar" ]
|
x+ |
Wild-1-or-more | Matches x if it occurs 1 or more times. |
"Foo Bar Bar" /(Foo)+ (Bar)+/ -> [ "Foo Bar Bar" ]
|
x? |
Can occur | Matches x if it occurs, otherwise, ignore it. |
"Foo " /Foo (Bar)?/ -> [ "Foo " ]
|
x{Y} |
Occurs set times | Matches if x occurs Y times. |
"Foo Bar Bar" /Fo{2} (?:Bar\s?){2}/ [ "Foo Bar Bar" ]
|
x{Y,Z} |
Occurs between set times | Matches if x occurs Y and Z times. |
"Foooo Bar Bar Bar Bar Bar" /Fo{2,5} (?:Bar\s?){1,10}/ [ "Foooo Bar Bar Bar Bar Bar" ]
|
x*? , x+? , x?? , x{Y}? , or x{Y,Z}? |
Lazy match | Matches x the least number of times possible, in accordance to the base rule. |
"Foooo Bar Bar Bar Bar Bar" /Fo{2,5} (?:Bar\s??){1,10}?/ [ "Foooo Bar" ]
|
Flags
Whilst there are flags other than the following, they are either non-standard, or do not have a baring on PenguinMod.
Flag | Name | Description |
---|---|---|
g |
g lobal |
Search all of a string, rather than stopping once you find an occurrence. |
i |
Case i nsensitive |
The search will ignore the case of characters, making /[A-Za-z]g and /[a-z]/gi equivalent.
|
m |
m ultiline |
Makes ^ and $ match the start and end of lines rather than the start and end of strings.
|
s |
s ingle line/dot all |
Makes . able to match all line terminators: \n , \r , \u2028 , and \u2029 .
|
u |
u nicode |
Makes the pattern treated as a sequence of unicode codepoints. |
v |
unicode upgrade | Similar to u , but updated with more features.
|
See also
(match [foo bar] with regex [foo] [g]:: operators)
<test regex [foo bar] [g] with text [foo]:: sensing>
External links
- Regular expressions on Wikipedia
- TurboWarp extension gallery featuring TrueFantom's RegExp extension. It can be loaded into PenguinMod using
https://extensions.turbowarp.org/true-fantom/regexp.js
as the URL in the Load Custom Extensions popup. It adds more regex functionality into PenguinMod. - regex101, fairly useful little app with some fun challenges to test your knowledge of regex.
- MDN's Regular expressions documentation for JavaScript. There wasn't a good place to cite this, but I sourced at lot of stuff from here. Pretty much all of the names for each syntax element.