Regular Expressions for Parsing CSS
I know, I know. Right off the bat, developers will ask you why wouldn’t you use an existing CSS parser and integrate it into your projects? If you are creating a more limited use for user inputted CSS, regular expressions will provide what is needed.
What are regular expressions, you ask? According to regular-expressions.info: “A regular expression … is a special text string for describing a search pattern.” You can search for letters, numbers, symbols and whitespace characters, among other things. From there, you can match these patterns and perform an action based on the match. I used regular expressions in a plugin I’m working on for validation of CSS from a form input. That is what we will explore here.
A Little Background/Example of Use
I initially used the style rule regex inside of a validation callback for a WordPress Customizer textarea input. Then, the same regex was used in a JavaScript on the frontend. Before I just drop a bunch of regexes, I will show one that was used in my plugin. More information on the Customizer is available at https://developer.wordpress.org/themes/advanced-topics/customizer-api/.
$wp_customize->add_setting( 'jbe_ch_displaybox_marginright', array( 'type' => 'theme_mod', ... 'sanitize_callback' => 'jbe_ch_sanitize_length', 'sanitize_js_callback' => 'jbe_ch_sanitize_length', ) ); $wp_customize->add_control( 'jbe_ch_displaybox_marginright', array( 'type' => 'textbox', ... ) );
The sanitize_callback and the sanitize_js_callback call the filter that uses the regex. The pseudocode below shows how it is used:
function jbe_ch_sanitize_length( $input ) { // This regular expression checks for a number, followed by the strings px or em or % // and does not allow anything other than those strings. It supports up to four digits if( regular expression in the form of a preg match goes here..., $input ) { return $input; } else { return; } }
In my actual code, the preg_match php regular expression is negated. So, in that case, we would return nothing first, then in the else statement, return $input.
An Explanation
Take the following example for the text @import url(http://some-website.com);
:
/^@import.*$/mi
^ and $ are anchors. Anchors match a location in relation to a string. The ^ symbol looks for a match that begins with @import. The $ matches the end of the string.
The . (dot) is a metacharacter that represents any single character except newline (\n). The * (asterisk) metacharacter repeats the previous token (explained later) 0 to an infinite number of times.
The m and i at the end are modifiers. Modifiers modify how the regex matches. m means the anchors ^ and $ should match the beginning and end of a whole line of text, rather than just a string within that line. i means case insensitive. Others include g for a global match. This is important when you have more than one instance of a pattern that needs to be matched. s makes the dot character match everything, including \n newline characters.
*the g modifier for global matches is not supported in php. Instead, use preg_match_all to match multiples of the same pattern. Also, the s modifier is not supported in JavaScript. Using [\s\S] (explained later) accomplishes the same task.
A token can represent a character group, a capturing group or a number of metacharacters. Capturing groups and character groups will be explained when we get into more complicated regular expressions below.
The Regexes I used
Px, Em or % values:
/[0-9]{1,4}(px|em|%)/i
In the regex above, there are character and capture groups.
In a character group, you can specify any number of characters, including underscores, dashes and periods. Special characters used within must be escaped with a \ backslash. See the section Special Characters at Regular-Expressions.info.
A capture group allows you to apply repetitions, for instance, to a part of the regular expression. A repetition would be represented by a * for 0 or more times, + for 1 or more times, ? for 0 or 1 time, or a range such as {1,4} for one to four times. Or, as in the above example, you can use it choose between several alternatives (separated by the | symbol).
I used the following in preg_replace statements to strip strings out and replace them with empty strings.
A line with an @import statement:
/^.*@import.*$/mi
A line with the word “expression”
/expression\(.*\)/i
A line with the string -moz-binding or -webkit-binding:
/(-moz-binding:|-webkit-binding:|binding:)\s.*/i
Take the CSS string background: url( javascript:alert( '1' ) );
This regex highlights the string javascript:something() and leaves the enclosing parentheses intact. preg_replace is used to replace the string with empty text.
/javascript:[a-z0-9\(\'\"\s][^)]*\)/i
Finally, here is a regex that formats a basic style rule:
/^(([a-z0-9\[\]=:]+\s?)|((div|span)?(#|\.){1}[a-z0-9\-_\s?:]+\s?)+)(\{[\s\S][^}]*})$/mi
This looks very complicated, huh? I will explain step by step what this regex is doing.
First, we have our beginning and ending anchors that say all of our style rules must begin and end with this pattern, nothing before or after.
We have a total of six capture groups. Let’s start with the second one.
([a-z0-9\[\]=:]+\s)
In the second capture group, we have a character group that looks for any single character a-z , the numbers 0-9, an opening and closing square bracket [ ] (the backslashes in front escape these, as they are special characters), an equal sign and a colon. These are all symbols that could appear in a CSS selector. The + sign says these should be repeated 1 or infinite times.
Following that is the or symbol |.
((div|span)?(#|\.){1}[a-z0-9\-_\s?:]+\s?))
Or, in the third capture group, we account for style rules that begin with classes or ids. The fourth capture group, embedded in the third, chooses between div or span and only allows those strings to precede our class/id selector. That is optional (?) and may show up once or not at all. Next, in the fifth capture group, we choose between # or \.. {1} means it is required only once. That is followed by letters, numbers, dashes, underscores and a colon that may be used in a class or id title. A \s that looks for any whitespace is thrown in for strings like .silver a:hover
. + once again repeats this one or infinite times. This is followed by whitespace that may be after the selector title.
The first capture group encloses the other four, to match everything within in order.
(\{[\s\S][^}]*})
The last capture group is for everything that follows the selector. It looks for the literal opening curly brace {. It must be escaped, as it is a regex special character. The character class with \s\S looks for whitespace and non-whitespace characters; literally anything. But if you have several style rules, this will select everything down to the closing curly brace of the last style rule. That’s not what we intend when we want to select each style rule. So, we use a negated character class [^}] to stop searching when the regex engine reaches the first closing curly brace }. The asterisk * matches the character class 0 or infinite times. The final curly brace matches the real curly brace on the end. I used multiline and case insensitive modifiers on the end. If you want to match multiple style rules in javascript, you must use the g modifier as well.
Conclusion
The usual disclaimers apply here. I learned how to create these as a result of Googling, trial and error, and lurking stack overflow posts. Needless to say, if you want to use these in real, public facing projects, use them at your own risk. Simply, I’m just expanding on previous experimentation and am also just learning as I am going along. Use these, but test them to see if they work first.
Resources
- PHP: preg_match
- PHP: preg_match_all
- JavaScript: .replace
- JavaScript: .match
- Regular-Expression.info
This regular expression testing site helped me greatly: