Title: | Core Functionality for the 'rebus' Package |
---|---|
Description: | Build regular expressions piece by piece using human readable code. This package contains core functionality, and is primarily intended to be used by package developers. |
Authors: | Richard Cotton [aut, cre] |
Maintainer: | Richard Cotton <[email protected]> |
License: | Unlimited |
Version: | 0.0-3 |
Built: | 2024-11-11 03:13:06 UTC |
Source: | https://github.com/richierocks/rebus.base |
START
matches the start of a string.
END
matches the end of a string.
exactly
makes the regular expression match the whole string, from
start to end.
START END exactly(x)
START END exactly(x)
x |
A character vector. |
An object of class regex
(inherits from character
) of length 1.
A character vector representing part or all of a regular expression.
Caret and dollar are used as start/end delimiters, since \A
and
\Z
are not supported by R's internal PRCE engine or stringi
's
ICU engine.
http://www.regular-expressions.info/anchors.html and http://www.rexegg.com/regex-anchors.html
START END # Usage x <- c("catfish", "tomcat", "cat") (rx_start <- START %R% "cat") (rx_end <- "cat" %R% END) (rx_exact <- exactly("cat")) stringi::stri_detect_regex(x, rx_start) stringi::stri_detect_regex(x, rx_end) stringi::stri_detect_regex(x, rx_exact)
START END # Usage x <- c("catfish", "tomcat", "cat") (rx_start <- START %R% "cat") (rx_end <- "cat" %R% END) (rx_exact <- exactly("cat")) stringi::stri_detect_regex(x, rx_start) stringi::stri_detect_regex(x, rx_end) stringi::stri_detect_regex(x, rx_exact)
as.regex
gives objects the class "regex"
. is.regex
tests for objects of class "regex"
.
as.regex(x) is.regex(x)
as.regex(x) is.regex(x)
x |
An object to test or convert. |
as.regex
returns the inputs object, with class
c("regex", "character")
.
is.regex
returns TRUE
when the input inherits from class
"regex"
and FALSE
otherwise.
x <- as.regex("month.abb") is.regex(x)
x <- as.regex("month.abb") is.regex(x)
Backreferences for replacement operations. These are used by replacement
functions such as sub
and
stri_replace_first_regex
, and by the stringi
and stringr
match functions such as
stri_match_first_regex
.
REF1 REF2 REF3 REF4 REF5 REF6 REF7 REF8 REF9 ICU_REF1 ICU_REF2 ICU_REF3 ICU_REF4 ICU_REF5 ICU_REF6 ICU_REF7 ICU_REF8 ICU_REF9
REF1 REF2 REF3 REF4 REF5 REF6 REF7 REF8 REF9 ICU_REF1 ICU_REF2 ICU_REF3 ICU_REF4 ICU_REF5 ICU_REF6 ICU_REF7 ICU_REF8 ICU_REF9
An object of class regex
(inherits from character
) of length 1.
http://www.regular-expressions.info/backref.html and http://www.rexegg.com/regex-capture.html
capture
, for creating capture groups that can be
referred to.
# For R's PCRE and Perl engines REF1 REF2 # and so on, up to REF9 # For stringi/stringr's ICU engine ICU_REF1 ICU_REF2 # and so on, up to ICU_REF9 # Usage sub("a(b)c(d)", REF1 %R% REF2, "abcd") stringi::stri_replace_first_regex("abcd", "a(b)c(d)", ICU_REF1 %R% ICU_REF2)
# For R's PCRE and Perl engines REF1 REF2 # and so on, up to REF9 # For stringi/stringr's ICU engine ICU_REF1 ICU_REF2 # and so on, up to ICU_REF9 # Usage sub("a(b)c(d)", REF1 %R% REF2, "abcd") stringi::stri_replace_first_regex("abcd", "a(b)c(d)", ICU_REF1 %R% ICU_REF2)
Create a token to capture or not.
capture(x) group(x) token(x) engroup(x, capture)
capture(x) group(x) token(x) engroup(x, capture)
x |
A character vector. |
capture |
Logical If |
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/brackets.html
or
for more examples
x <- "foo" capture(x) group(x) # Usage # capture is good with match functions (rx_price <- capture(digit(1, Inf) %R% DOT %R% digit(2))) (rx_quantity <- capture(digit(1, Inf))) (rx_all <- DOLLAR %R% rx_price %R% " for " %R% rx_quantity) stringi::stri_match_first_regex("The price was $123.99 for 12.", rx_all) # group is mostly used with alternation. See ?or. (rx_spread <- group("peanut butter" %|% "jam" %|% "marmalade")) stringi::stri_extract_all_regex( "You can have peanut butter, jam, or marmalade on your toast.", rx_spread )
x <- "foo" capture(x) group(x) # Usage # capture is good with match functions (rx_price <- capture(digit(1, Inf) %R% DOT %R% digit(2))) (rx_quantity <- capture(digit(1, Inf))) (rx_all <- DOLLAR %R% rx_price %R% " for " %R% rx_quantity) stringi::stri_match_first_regex("The price was $123.99 for 12.", rx_all) # group is mostly used with alternation. See ?or. (rx_spread <- group("peanut butter" %|% "jam" %|% "marmalade")) stringi::stri_extract_all_regex( "You can have peanut butter, jam, or marmalade on your toast.", rx_spread )
Group characters together in a class to match any of them (char_class
)
or none of them (negated_char_class
).
char_class(...) negated_char_class(...) negate_and_group(...)
char_class(...) negated_char_class(...) negate_and_group(...)
... |
Character vectors. |
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/charclass.html
char_class(LOWER, "._") negated_char_class(LOWER, "._") # Usage x <- (1:10) ^ 2 (rx_odd <- char_class(1, 3, 5, 7, 9)) (rx_not_odd <- negated_char_class(1, 3, 5, 7, 9)) stringi::stri_detect_regex(x, rx_odd) stringi::stri_detect_regex(x, rx_not_odd)
char_class(LOWER, "._") negated_char_class(LOWER, "._") # Usage x <- (1:10) ^ 2 (rx_odd <- char_class(1, 3, 5, 7, 9)) (rx_not_odd <- negated_char_class(1, 3, 5, 7, 9)) stringi::stri_detect_regex(x, rx_odd) stringi::stri_detect_regex(x, rx_not_odd)
Match a class of values. These are typically used in combination with
char_class
to create new character classes.
ALPHA ALNUM BLANK CNTRL DIGIT GRAPH LOWER PRINT PUNCT SPACE UPPER HEX_DIGIT ANY_CHAR GRAPHEME NEWLINE DGT WRD SPC NOT_DGT NOT_WRD NOT_SPC ASCII_DIGIT ASCII_LOWER ASCII_UPPER ASCII_ALPHA ASCII_ALNUM UNMATCHABLE
ALPHA ALNUM BLANK CNTRL DIGIT GRAPH LOWER PRINT PUNCT SPACE UPPER HEX_DIGIT ANY_CHAR GRAPHEME NEWLINE DGT WRD SPC NOT_DGT NOT_WRD NOT_SPC ASCII_DIGIT ASCII_LOWER ASCII_UPPER ASCII_ALPHA ASCII_ALNUM UNMATCHABLE
An object of class regex
(inherits from character
) of length 1.
ClassGroups
for the functional form,
SpecialCharacters
for regex metacharacters,
Anchors
for constants to match the start/end of a string,
WordBoundaries
for contants to match the start/end of a word.
# R character classes ALNUM ALPHA BLANK CNTRL DIGIT GRAPH LOWER PRINT PUNCT SPACE UPPER HEX_DIGIT # Special chars ANY_CHAR GRAPHEME NEWLINE # Generic classes DGT WRD SPC # Generic negated classes NOT_DGT NOT_WRD NOT_SPC # Non-locale-specific classes ASCII_DIGIT ASCII_LOWER ASCII_UPPER ASCII_ALPHA ASCII_ALNUM # An oxymoron UNMATCHABLE # Usage x <- c("a1 A", "a1 a") rx <- LOWER %R% DIGIT %R% SPACE %R% UPPER stringi::stri_detect_regex(x, rx)
# R character classes ALNUM ALPHA BLANK CNTRL DIGIT GRAPH LOWER PRINT PUNCT SPACE UPPER HEX_DIGIT # Special chars ANY_CHAR GRAPHEME NEWLINE # Generic classes DGT WRD SPC # Generic negated classes NOT_DGT NOT_WRD NOT_SPC # Non-locale-specific classes ASCII_DIGIT ASCII_LOWER ASCII_UPPER ASCII_ALPHA ASCII_ALNUM # An oxymoron UNMATCHABLE # Usage x <- c("a1 A", "a1 a") rx <- LOWER %R% DIGIT %R% SPACE %R% UPPER stringi::stri_detect_regex(x, rx)
Match character classes.
alnum(lo, hi, char_class = TRUE) alpha(lo, hi, char_class = TRUE) blank(lo, hi, char_class = TRUE) cntrl(lo, hi, char_class = TRUE) digit(lo, hi, char_class = TRUE) graph(lo, hi, char_class = TRUE) lower(lo, hi, char_class = TRUE) printable(lo, hi, char_class = TRUE) punct(lo, hi, char_class = TRUE) space(lo, hi, char_class = TRUE) upper(lo, hi, char_class = TRUE) hex_digit(lo, hi, char_class = TRUE) any_char(lo, hi) grapheme(lo, hi) newline(lo, hi) dgt(lo, hi, char_class = FALSE) wrd(lo, hi, char_class = FALSE) spc(lo, hi, char_class = FALSE) not_dgt(lo, hi, char_class = FALSE) not_wrd(lo, hi, char_class = FALSE) not_spc(lo, hi, char_class = FALSE) ascii_digit(lo, hi, char_class = TRUE) ascii_lower(lo, hi, char_class = TRUE) ascii_upper(lo, hi, char_class = TRUE) ascii_alpha(lo, hi, char_class = TRUE) ascii_alnum(lo, hi, char_class = TRUE) char_range(lo, hi, char_class = lo < hi)
alnum(lo, hi, char_class = TRUE) alpha(lo, hi, char_class = TRUE) blank(lo, hi, char_class = TRUE) cntrl(lo, hi, char_class = TRUE) digit(lo, hi, char_class = TRUE) graph(lo, hi, char_class = TRUE) lower(lo, hi, char_class = TRUE) printable(lo, hi, char_class = TRUE) punct(lo, hi, char_class = TRUE) space(lo, hi, char_class = TRUE) upper(lo, hi, char_class = TRUE) hex_digit(lo, hi, char_class = TRUE) any_char(lo, hi) grapheme(lo, hi) newline(lo, hi) dgt(lo, hi, char_class = FALSE) wrd(lo, hi, char_class = FALSE) spc(lo, hi, char_class = FALSE) not_dgt(lo, hi, char_class = FALSE) not_wrd(lo, hi, char_class = FALSE) not_spc(lo, hi, char_class = FALSE) ascii_digit(lo, hi, char_class = TRUE) ascii_lower(lo, hi, char_class = TRUE) ascii_upper(lo, hi, char_class = TRUE) ascii_alpha(lo, hi, char_class = TRUE) ascii_alnum(lo, hi, char_class = TRUE) char_range(lo, hi, char_class = lo < hi)
lo |
A non-negative integer. Minimum number of repeats, when grouped. |
hi |
positive integer. Maximum number of repeats, when grouped. |
char_class |
A logical value. Should |
A character vector representing part or all of a regular expression.
R has many built-in locale-dependent character classes, like
[:alnum:]
(representing alphanumeric characters, that is lower or
upper case letters or numbers). Some of these behave in unexpected ways
when using the ICU engine (that is, when using stringi
or
stringr
). See the punctuation example. For these engines, using
Unicode properties (UnicodeProperty
) may give
you a more reliable match.
There are also some generic character classes like \w
(representing
lower or upper case letters or numbers or underscores). Since version 0.0-3,
these use the default char_class = FALSE
, since they already act as
character classes.
Finally, there are ASCII-only ways of specifying letters like a-zA-Z
.
Which version you want depends upon how you want to deal with international
characters, and the vagaries of the underlying regular expression engine.
I suggest reading the regex
help page and doing lots of
testing.
http://www.regular-expressions.info/shorthand.html and http://www.rexegg.com/regex-quickstart.html#posix
# R character classes alnum() alpha() blank() cntrl() digit() graph() lower() printable() punct() space() upper() hex_digit() # Special chars any_char() grapheme() newline() # Generic classes dgt() wrd() spc() # Generic negated classes not_dgt() not_wrd() not_spc() # Non-locale-specific classes ascii_digit() ascii_lower() ascii_upper() # Don't provide a class wrapper digit(char_class = FALSE) # same as DIGIT # Match repeated values digit(3) digit(3, 5) digit(0) digit(1) digit(0, 1) # Ranges of characters char_range(0, 7) # octal number # Usage (rx <- digit(3)) stringi::stri_detect_regex(c("123", "one23"), rx) # Some classes behave differently under different engines # In particular PRCE and Perl recognise all these characters # as punctuation but ICU does not p <- c( "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "[", "]", "{", "}", ";", ":", "'", '"', ",", "<", ">", ".", "/", "?", "\\", "|", "`", "~" ) icu_matched <- stringi::stri_detect_regex(p, punct()) p[icu_matched] p[!icu_matched] pcre_matched <- grepl(punct(), p) p[pcre_matched] p[!pcre_matched] # A grapheme is a character that can be defined by more than one code point # PCRE does not recognise the concept. x <- c("Chloe", "Chlo\u00e9", "Chlo\u0065\u0301") stringi::stri_match_first_regex(x, "Chlo" %R% capture(grapheme())) # newline() matches three types of line ending: \r, \n, \r\n. # You can standardize line endings using stringi::stri_replace_all_regex("foo\nbar\r\nbaz\rquux", NEWLINE, "\n")
# R character classes alnum() alpha() blank() cntrl() digit() graph() lower() printable() punct() space() upper() hex_digit() # Special chars any_char() grapheme() newline() # Generic classes dgt() wrd() spc() # Generic negated classes not_dgt() not_wrd() not_spc() # Non-locale-specific classes ascii_digit() ascii_lower() ascii_upper() # Don't provide a class wrapper digit(char_class = FALSE) # same as DIGIT # Match repeated values digit(3) digit(3, 5) digit(0) digit(1) digit(0, 1) # Ranges of characters char_range(0, 7) # octal number # Usage (rx <- digit(3)) stringi::stri_detect_regex(c("123", "one23"), rx) # Some classes behave differently under different engines # In particular PRCE and Perl recognise all these characters # as punctuation but ICU does not p <- c( "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "[", "]", "{", "}", ";", ":", "'", '"', ",", "<", ">", ".", "/", "?", "\\", "|", "`", "~" ) icu_matched <- stringi::stri_detect_regex(p, punct()) p[icu_matched] p[!icu_matched] pcre_matched <- grepl(punct(), p) p[pcre_matched] p[!pcre_matched] # A grapheme is a character that can be defined by more than one code point # PCRE does not recognise the concept. x <- c("Chloe", "Chlo\u00e9", "Chlo\u0065\u0301") stringi::stri_match_first_regex(x, "Chlo" %R% capture(grapheme())) # newline() matches three types of line ending: \r, \n, \r\n. # You can standardize line endings using stringi::stri_replace_all_regex("foo\nbar\r\nbaz\rquux", NEWLINE, "\n")
Operator equivalent of regex
.
x %c% y x %R% y
x %c% y x %R% y
x |
A character vector. |
y |
A character vector. |
A character vector representing part or all of a regular expression.
%c%
was the original operator for this ('c' for
'concatenate'). This is hard work to type on a QWERTY keyboard
though, so it has been replaced with %R%
.
# Notice the recycling letters %R% month.abb
# Notice the recycling letters %R% month.abb
Prefix the special characters with a blackslash to make them literal characters.
escape_special(x, escape_brace = TRUE)
escape_special(x, escape_brace = TRUE)
x |
A character vector. |
escape_brace |
A logical value indicating if opening braces should be
escaped. If using R's internal PRCE engine or |
A character vector, with regex meta-characters escaped.
Special characters inside character classes (except caret, hypen and closing bracket in certain positions) do not need to be escaped. This function makes no attempt to parse your regular expression and decide whether or not the special character is inside a character class or not. It simply escapes every value.
escape_special("\\ ^ $ . | ? * + ( ) { } [ ]")
escape_special("\\ ^ $ . | ? * + ( ) { } [ ]")
Prints/formats objects of class regex
.
## S3 method for class 'regex' format(x, ...) ## S3 method for class 'regex' print(x, encode_string = FALSE, ...)
## S3 method for class 'regex' format(x, ...) ## S3 method for class 'regex' print(x, encode_string = FALSE, ...)
x |
A regex object. |
... |
Passed from other format methods. Currently ignored. |
encode_string |
If |
format.regex
returns a character vector. print.regex
is invoked for the side effect of printing the regex object.
group(1:5) lookahead(1:5)
group(1:5) lookahead(1:5)
Treats its contents as literal characters. Equivalent to using
fixed = TRUE
, but for part of the pattern rather than all of it.
literal(x)
literal(x)
x |
A character vector. |
A character vector representing part or all of a regular expression.
(rx <- digit(1, 3)) (rx_literal <- literal(rx)) # Usage stringi::stri_detect_regex("123", rx) stringi::stri_detect_regex("123", rx_literal) stringi::stri_detect_regex("[[:digit:]]{1,3}", rx_literal)
(rx <- digit(1, 3)) (rx_literal <- literal(rx)) # Usage stringi::stri_detect_regex("123", rx) stringi::stri_detect_regex("123", rx_literal) stringi::stri_detect_regex("[[:digit:]]{1,3}", rx_literal)
Zero length matching. That is, the characters are matched when detecting, but not matching or extrcting.
lookahead(x) negative_lookahead(x) lookbehind(x) negative_lookbehind(x)
lookahead(x) negative_lookahead(x) lookbehind(x) negative_lookbehind(x)
x |
A character vector. |
A character vector representing part or all of a regular expression.
Lookbehind is not supported by R's PRCE engine. Use R's Perl engine
or stringi
/stringr
's ICU engine.
http://www.regular-expressions.info/lookaround.html and http://www.rexegg.com/regex-lookarounds.html
x <- "foo" lookahead(x) negative_lookahead(x) lookbehind(x) negative_lookbehind(x) # Usage x <- c("mozambique", "qatar", "iraq") # q followed by a character that isn't u (rx_neg_class <- "q" %R% negated_char_class("u")) # q not followed by a u (rx_neg_lookahead <- "q" %R% negative_lookahead("u")) stringi::stri_detect_regex(x, rx_neg_class) stringi::stri_detect_regex(x, rx_neg_lookahead) stringi::stri_extract_first_regex(x, rx_neg_class) stringi::stri_extract_first_regex(x, rx_neg_lookahead) # PRCE engine doesn't support lookbehind x2 <- c("queen", "vacuum") (rx_lookbehind <- lookbehind("q")) %R% "u" stringi::stri_detect_regex(x2, rx_lookbehind) try(grepl(rx_lookbehind, x2)) grepl(rx_lookbehind, x2, perl = TRUE)
x <- "foo" lookahead(x) negative_lookahead(x) lookbehind(x) negative_lookbehind(x) # Usage x <- c("mozambique", "qatar", "iraq") # q followed by a character that isn't u (rx_neg_class <- "q" %R% negated_char_class("u")) # q not followed by a u (rx_neg_lookahead <- "q" %R% negative_lookahead("u")) stringi::stri_detect_regex(x, rx_neg_class) stringi::stri_detect_regex(x, rx_neg_lookahead) stringi::stri_extract_first_regex(x, rx_neg_class) stringi::stri_extract_first_regex(x, rx_neg_lookahead) # PRCE engine doesn't support lookbehind x2 <- c("queen", "vacuum") (rx_lookbehind <- lookbehind("q")) %R% "u" stringi::stri_detect_regex(x2, rx_lookbehind) try(grepl(rx_lookbehind, x2)) grepl(rx_lookbehind, x2, perl = TRUE)
Applies one or more mode modifiers to the regular expression.
modify_mode(x, modes = c("i", "x", "s", "m", "J", "X")) case_insensitive(x) free_spacing(x) single_line(x) multi_line(x) duplicate_group_names(x) no_backslash_escaping(x)
modify_mode(x, modes = c("i", "x", "s", "m", "J", "X")) case_insensitive(x) free_spacing(x) single_line(x) multi_line(x) duplicate_group_names(x) no_backslash_escaping(x)
x |
A character vector. |
modes |
A character vector of mode modifiers. |
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/modifiers.html and http://www.rexegg.com/regex-modifiers.html
x <- "foo" case_insensitive(x) free_spacing(x) single_line(x) multi_line(x) duplicate_group_names(x) no_backslash_escaping(x) modify_mode(x, c("i", "J", "X"))
x <- "foo" case_insensitive(x) free_spacing(x) single_line(x) multi_line(x) duplicate_group_names(x) no_backslash_escaping(x) modify_mode(x, c("i", "J", "X"))
Match one string or another.
or(..., capture = FALSE) x %|% y or1(x, capture = FALSE)
or(..., capture = FALSE) x %|% y or1(x, capture = FALSE)
... |
Character vectors. |
capture |
A logical value indicating whether or not the result should be captured. See note. |
x |
A character vector. |
y |
A character vector. |
A character vector representing part or all of a regular expression.
or
takes multiple character vector inputs and returns a
character vector of the inputs separated by pipes. %|%
is an operator
interface to this function. or1
takes a single character vector and
returns a string collapsed by pipes.
When capture
is TRUE
, the values are wrapped in a capture
group (see capture
). When capture
is FALSE
(the
default for or
and or1
), the values are wrapped in a
non-capture group (see token
). When capture
is
NA
, (the case for %|%
) the values are not wrapped in
anything.
http://www.regular-expressions.info/alternation.html
# or takes an arbitrary number of arguments and groups them without capture # Notice the recycling of inputs or(letters, month.abb, "foo") # or1 takes a single character vector or1(c(letters, month.abb, "foo")) # Not the same as before! # Capture the group or1(letters, capture = TRUE) # Don't create a group or1(letters, capture = NA) # The pipe operator doesn't group letters %|% month.abb %|% "foo" # Usage (rx <- or("dog", "cat", "hippopotamus")) stringi::stri_detect_regex(c("boondoggle", "caterwaul", "water-horse"), rx)
# or takes an arbitrary number of arguments and groups them without capture # Notice the recycling of inputs or(letters, month.abb, "foo") # or1 takes a single character vector or1(c(letters, month.abb, "foo")) # Not the same as before! # Capture the group or1(letters, capture = TRUE) # Don't create a group or1(letters, capture = NA) # The pipe operator doesn't group letters %|% month.abb %|% "foo" # Usage (rx <- or("dog", "cat", "hippopotamus")) stringi::stri_detect_regex(c("boondoggle", "caterwaul", "water-horse"), rx)
Makes the regular expression (or part of it) recursive.
recursive(x)
recursive(x)
x |
A character vector. |
A character vector representing part or all of a regular expression.
Recursion is not supported by R's internal PRCE engine or
stringi
's ICU engine.
http://www.regular-expressions.info/recurse.html and http://www.rexegg.com/regex-recursion.html
recursive("a") # Recursion isn't supported by R's PRCE engine or # stringi/stringr's ICU engine x <- c("ab222z", "ababz", "ab", "abab") rx <- "ab(?R)?z" grepl(rx, x, perl = TRUE) try(grepl(rx, x)) try(stringi::stri_detect_regex(x, rx))
recursive("a") # Recursion isn't supported by R's PRCE engine or # stringi/stringr's ICU engine x <- c("ab222z", "ababz", "ab", "abab") rx <- "ab(?R)?z" grepl(rx, x, perl = TRUE) try(grepl(rx, x)) try(stringi::stri_detect_regex(x, rx))
Creates a regex object.
regex(...)
regex(...)
... |
Passed to |
An object of class regex
.
This works like paste0
, but the returns value has class
c("regex", "character")
.
paste0
as.regex(month.abb)
regex(letters[1:5], "?")
Match repeated values.
repeated(x, lo, hi, lazy = FALSE, char_class = NA) optional(x, char_class = NA) lazy(x) zero_or_more(x, char_class = NA) one_or_more(x, char_class = NA)
repeated(x, lo, hi, lazy = FALSE, char_class = NA) optional(x, char_class = NA) lazy(x) zero_or_more(x, char_class = NA) one_or_more(x, char_class = NA)
x |
A character vector. |
lo |
A non-negative integer. Minimum number of repeats, when grouped. |
hi |
positive integer. Maximum number of repeats, when grouped. |
lazy |
A logical value. Should repetition be matched lazily or greedily? |
char_class |
A logical value. Should |
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/repeat.html and http://www.rexegg.com/regex-quantifiers.html
# Can match constants or class values repeated(GRAPH, 2, 5) repeated(graph(), 2, 5) # same # Short cuts for special cases optional(blank()) # same as repeated(blank(), 0, 1) zero_or_more(hex_digit()) # same as repeated(hex_digit(), 0, Inf) one_or_more(printable()) # same as repeated(printable(), 1, Inf) # 'Lazy' matching (match smallest no. of chars) repeated(cntrl(), 2, 5, lazy = TRUE) lazy(one_or_more(cntrl())) # Overriding character class wrapping repeated(ANY_CHAR, 2, 5, char_class = FALSE) # Usage x <- "1234567890" stringi::stri_extract_first_regex(x, one_or_more(DIGIT)) stringi::stri_extract_first_regex(x, repeated(DIGIT, lo = 3, hi = 6)) stringi::stri_extract_first_regex(x, lazy(repeated(DIGIT, lo = 3, hi = 6))) col <- c("color", "colour") stringi::stri_detect_regex(col, "colo" %R% optional("u") %R% "r")
# Can match constants or class values repeated(GRAPH, 2, 5) repeated(graph(), 2, 5) # same # Short cuts for special cases optional(blank()) # same as repeated(blank(), 0, 1) zero_or_more(hex_digit()) # same as repeated(hex_digit(), 0, Inf) one_or_more(printable()) # same as repeated(printable(), 1, Inf) # 'Lazy' matching (match smallest no. of chars) repeated(cntrl(), 2, 5, lazy = TRUE) lazy(one_or_more(cntrl())) # Overriding character class wrapping repeated(ANY_CHAR, 2, 5, char_class = FALSE) # Usage x <- "1234567890" stringi::stri_extract_first_regex(x, one_or_more(DIGIT)) stringi::stri_extract_first_regex(x, repeated(DIGIT, lo = 3, hi = 6)) stringi::stri_extract_first_regex(x, lazy(repeated(DIGIT, lo = 3, hi = 6))) col <- c("color", "colour") stringi::stri_detect_regex(col, "colo" %R% optional("u") %R% "r")
Forces replacement values to be upper or lower case. Only supported by Perl regular expressions.
as_lower(x) as_upper(x)
as_lower(x) as_upper(x)
x |
A character vector. |
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/replacecase.html
# Convert to title case using Perl regex x <- "In caSE of DISASTER, PuLl tHe CoRd" matching_rx <- capture(WRD) %R% capture(wrd(1, Inf)) replacement_rx <- as_upper(REF1) %R% as_lower(REF2) gsub(matching_rx, replacement_rx, x, perl = TRUE) # PCRE and ICU do not currently support this operation # The next lines are intended to return gibberish gsub(matching_rx, replacement_rx, x) replacement_rx_icu <- as_upper(ICU_REF1) %R% as_lower(ICU_REF2) stringi::stri_replace_all_regex(x, matching_rx, replacement_rx_icu)
# Convert to title case using Perl regex x <- "In caSE of DISASTER, PuLl tHe CoRd" matching_rx <- capture(WRD) %R% capture(wrd(1, Inf)) replacement_rx <- as_upper(REF1) %R% as_lower(REF2) gsub(matching_rx, replacement_rx, x, perl = TRUE) # PCRE and ICU do not currently support this operation # The next lines are intended to return gibberish gsub(matching_rx, replacement_rx, x) replacement_rx_icu <- as_upper(ICU_REF1) %R% as_lower(ICU_REF2) stringi::stri_replace_all_regex(x, matching_rx, replacement_rx_icu)
Constants to match special characters.
BACKSLASH CARET DOLLAR DOT PIPE QUESTION STAR PLUS OPEN_PAREN CLOSE_PAREN OPEN_BRACKET CLOSE_BRACKET OPEN_BRACE
BACKSLASH CARET DOLLAR DOT PIPE QUESTION STAR PLUS OPEN_PAREN CLOSE_PAREN OPEN_BRACKET CLOSE_BRACKET OPEN_BRACE
An object of class regex
(inherits from character
) of length 1.
http://www.regular-expressions.info/characters.html
escape_special
for the functional form,
CharacterClasses
for regex metacharacters,
Anchors
for constants to match the start/end of a string,
WordBoundaries
for contants to match the start/end of a word.
BACKSLASH CARET DOLLAR DOT PIPE QUESTION STAR PLUS OPEN_PAREN CLOSE_PAREN OPEN_BRACKET CLOSE_BRACKET OPEN_BRACE # Usage x <- "\\^$." rx <- BACKSLASH %R% CARET %R% DOLLAR %R% DOT stringi::stri_detect_regex(x, rx) # No escapes - these chars have special meaning inside regex stringi::stri_detect_regex(x, x) # Usually closing brackets can be matched without escaping stringi::stri_detect_regex("]", "]") # If you want to match a closing bracket inside a character class # the closing bracket must be placed first (rx <- char_class("]a")) stringi::stri_detect_regex("]", rx) # ICU and Perl also allows you to place the closing bracket in # other positions if you escape it (rx <- char_class("a", CLOSE_BRACKET)) stringi::stri_detect_regex("]", rx) grepl(rx, "]", perl = TRUE) # PCRE does not allow this grepl(rx, "]")
BACKSLASH CARET DOLLAR DOT PIPE QUESTION STAR PLUS OPEN_PAREN CLOSE_PAREN OPEN_BRACKET CLOSE_BRACKET OPEN_BRACE # Usage x <- "\\^$." rx <- BACKSLASH %R% CARET %R% DOLLAR %R% DOT stringi::stri_detect_regex(x, rx) # No escapes - these chars have special meaning inside regex stringi::stri_detect_regex(x, x) # Usually closing brackets can be matched without escaping stringi::stri_detect_regex("]", "]") # If you want to match a closing bracket inside a character class # the closing bracket must be placed first (rx <- char_class("]a")) stringi::stri_detect_regex("]", rx) # ICU and Perl also allows you to place the closing bracket in # other positions if you escape it (rx <- char_class("a", CLOSE_BRACKET)) stringi::stri_detect_regex("]", rx) grepl(rx, "]", perl = TRUE) # PCRE does not allow this grepl(rx, "]")
BOUNDARY
matches a word boundary.
whole_word
wraps a regex in word boundary tokens to match a whole
word.
BOUNDARY NOT_BOUNDARY whole_word(x)
BOUNDARY NOT_BOUNDARY whole_word(x)
x |
A character vector. |
An object of class regex
(inherits from character
) of length 1.
A character vector representing part or all of a regular expression.
http://www.regular-expressions.info/wordboundaries.html and http://www.rexegg.com/regex-boundaries.html
BOUNDARY NOT_BOUNDARY # Usage x <- c("the catfish miaowed", "the tomcat miaowed", "the cat miaowed") (rx_before <- BOUNDARY %R% "cat") (rx_after <- "cat" %R% BOUNDARY) (rx_whole_word <- whole_word("cat")) stringi::stri_detect_regex(x, rx_before) stringi::stri_detect_regex(x, rx_after) stringi::stri_detect_regex(x, rx_whole_word)
BOUNDARY NOT_BOUNDARY # Usage x <- c("the catfish miaowed", "the tomcat miaowed", "the cat miaowed") (rx_before <- BOUNDARY %R% "cat") (rx_after <- "cat" %R% BOUNDARY) (rx_whole_word <- whole_word("cat")) stringi::stri_detect_regex(x, rx_before) stringi::stri_detect_regex(x, rx_after) stringi::stri_detect_regex(x, rx_whole_word)