In this blog we will discuss regular expressions and how you can employ them in your Legato scripts to add robust field validation as well as patterned string replacements. Regular expressions are a powerful tool that can be used in many languages so the concepts here can apply to things outside the scope of Legato. There are a few different “flavors” of regular expressions, but Legato uses the ECMAScript standard (very similar to Perl). If you already have a good understanding of regular expressions, you may wish to skip to the “Using Regular Expression in Legato” section.
Friday, January 20. 2017
Legato Developers Corner #18: Regular Expressions and Legato
A Quick Introduction to Regular Expressions
For those who have not used regular expressions before, the concept is straightforward. You create a pattern using the regular expression syntax and then compare data to that pattern. A common parallel to every day computing is using wildcards to find files on your computer. When you try to open a Word document in Office, Word limits the directory listing to files that end in .doc or .docx. Regular expressions can do simple matching like that and so much more.
Basic Syntax and Examples
There are many resources for regular expressions on the Internet so we are going to just cover some of the basics. For more in depth look into what options there are, this Wikipedia article is a good place to start.
The most basic pattern is just text. Any characters that do not have any special meaning are matched as-is. For example, the pattern “cat” matches only the word “cat”. If we wanted to make it match many pets we could use the or operator ‘|’ like this “cat|dog|bird|turtle”. This pattern would match any of the words listed. It is important to note that the character ‘.’ means any character so if you want to use a period (or any other special character), it needs to be escaped with a backslash.
To add more power we can use quantifiers. Quantifiers say how often the preceding item is allowed to occur. For example, if we wanted the above example to allow plural words we could make it “(cat|dog|bird|turtle)s?”. The parenthesis group the names together and the question mark means match the preceding item ‘s’ once or not at all. The quantifiers are: ?, *, +, {#}, {#, } and {#, #} where # is a number. ‘*’ is zero or more, ‘+’ is one or more, “{2}” means exactly 2 times, “{2,}” means 2 or more, and “{2,5}” means 2 to 5 times.
The next basic pattern concept is sets. Sets mean any character in the collection. Sets are denoted by the “[ ]” characters. For example, the pattern “[abc]” means any character that is ‘a’, ‘b’, or ‘c’. Sets can also have ranges like “[A-Z]” (any character between ‘A’ and ‘Z’) and be inverted “[^abc]” (any character that is not ‘a’, ‘b’, or ‘c’.)
Another special option are character classes. Character classes are shorthand characters that mean specific things. For example, “\d” means any digit. This is actually equivalent to the set “[0-9]” but is shorter. Likewise, “\w” means any word character in this default set, which translates into “[A-Za-z0-9_]”. This means:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_
Character classes are very specific to the particular regular expressions engine but generally things like digits, words, and spaces are available.
So bringing all this together we can make some patterns:
[0-9]{10} | Matches a CIK | |
(-|\+)?[0-9]+(\.[0-9]+)? | Matches any decimal number (with optional -/+) | |
([^A-Z ])([A-Z]) | Matches any capital letter not preceded by another capital letter or space | |
(a|e|i|o|u|A|E|I|O|U)[^ ]* | Matches any word that begins with a vowel | |
\S+@(\S+\.)+[A-Za-z]{2,} | Matches an email address |
The last regular expression topic to discuss is using backreferences. A Backreference is a way to use the text matched earlier in the pattern again. Say we want to create a pattern to detect if the same word is written twice. We can do that easily with a backreference. The syntax for backreferences is a backslash followed by the group number within the pattern. The numbering gets a little complicated when there are nested groups but the concept is still the same. If we wanted to create a pattern for that duplicate word example above, it would be “([^ ]+) \1” This pattern is a group of any nonspace character one or more times followed by a space. Then the backreference means the same characters that matched group 1 again.
Using Regular Expressions in Legato
Legato has two regular expression functions:
boolean = IsRegexMatch ( string text, string pattern );
and
string = ReplaceInStringRegex ( string text, string pattern, string replace );
The IsRegexMatch function tests a string against a pattern and returns true if it matches and false if it does not. This makes the function ideal for validating user input. The entire input string must match the pattern. If a partial match is needed the pattern should be changed to reflect that.
The ReplaceInStringRegex function finds text that matches the pattern and replaces it with a string specified by the replace parameter. This function replaces all occurrences of text that match the pattern parameterstring. The pattern does not need to completely match the input. The replace parameter can also contain backreferences to items matches in the pattern parameter. The syntax is slightly different than before. In the replace string, we use “$1” to represent a backreference to the first group instead of “\1”.
Here is an example of backreferences in a replace operation:
string pattern = "(^'|( )')([^']+)'";
AddMessage(ReplaceInStringRegex("I've poorly 'quoted things'", pattern, "$2\"$3\""));
Creates this output:
I've poorly "quoted things"
The above example replaced the text surrounded by single quotes with double quotes while ignoring the apostrophe in “I’ve”. The $2 indicates the second backreference (in this case it is the space in parenthesis) and the $3 indicates the third backreference, which is the text that is quoted.
One important thing to note when using regular expressions in Legato is the expressions themselves may require extensive use of the backslash character. This character also needs to be escaped in Legato when using string literals. A simple pattern like “\w+\.txt” will need to be escaped:
AddMessage("%d", IsRegexMatch("cat.txt", "\\w+\\.txt"));
A Sample Function Using Regular Expressions
This sample function uses regular expressions to rename files within a directory. The function has many production uses, like renaming slides from PowerPoint, as well as personal applications, such as renaming camera pictures or music files.
// Renames all Files and Directories in path using pattern int RenameInDirectory(string path, string pattern, string replace, boolean bProcessSub, boolean bRename) { int matched; // Matched items handle hFF; boolean hasfiles; string src, dst, name, nname; dword dw; int rc; hFF = GetFirstFile(AddPaths(path, "*")); if (IsError(hFF)) { return 0; } matched = 0; hasfiles = true; while (hasfiles) { name = GetName(hFF); dw = GetFileAttributeBits(hFF); if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) { if (bProcessSub == TRUE && name != "." && name != "..") { name = AddPaths(path, name); ConsolePrint("Entering Directory %s\r\n", name); matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename); } } else { if (IsRegexMatch(name, pattern)) { nname = ReplaceInStringRegex(name, pattern, replace); ConsolePrint("Renaming File %s to %s\r\n", name, nname); src = AddPaths(path, name); dst = AddPaths(path, nname); if (bRename == true) { rc = RenameFile(src, dst); if (IsError(rc)){ ConsolePrint(" FAILED! 0x%08X\r\n", rc); } else { matched++; } } else { matched++; } } else { ConsolePrint("Skipping File %s\r\n", name); } } rc = GetNextFile(hFF); if (IsError(rc)) { hasfiles = false; } } CloseHandle(hFF); return matched; }
The RenameInDirectory function takes several parameters: the path in which to search, the pattern to match, the text with which the function will perform the replace operation, and then two booleans. One boolean indicates whether to process subdirectories, and the second indicates whether to rename files. This function returns the number of files that have been / would be renamed. It first defines the many necessary variables.
hFF = GetFirstFile(AddPaths(path, "*")); if (IsError(hFF)) { return 0; } matched = 0; hasfiles = true; while (hasfiles) { name = GetName(hFF); dw = GetFileAttributeBits(hFF);
The process starts by using the GetFirstFile function to get a list of files and directories in the supplied path. Then it gets the name of the file using the GetName function and the attributes of the file using the GetFileAttributeBits function. This occurs in a while loop to go through every file.
if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) { if (bProcessSub == TRUE && name != "." && name != "..") { name = AddPaths(path, name); ConsolePrint("Entering Directory %s\r\n", name); matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename); } }
The if statement checks to see if the current item is a directory. If it is a directory and the user wants to process directories and it is not the “.” or “..” directory, it is also processed. The ConsolePrint function is used to tell the user what is going on during the rename process. The RenameInDirectory function is called again with the subdirectory as the path and the amount of matched items is increased by the amount renamed in the inner call. This makes our RenameInDirectory function recursive with a very clear ending point (a directory with no subdirectories).
else { if (IsRegexMatch(name, pattern)) { nname = ReplaceInStringRegex(name, pattern, replace); ConsolePrint("Renaming File %s to %s\r\n", name, nname); src = AddPaths(path, name); dst = AddPaths(path, nname);
If the item was not a directory, the regular expression magic happens. The function checks to see if the name of the file matches the pattern using the IsRegexMatch SDK function. This is important since it forces the pattern to match the entire filename. The script could do partial matching instead; we’ll discuss this more later. The ReplaceInStringRegex function is then called on the filename to create the new filename. The resulting names are printed to the console and then qualified paths are created.
if (bRename == true) { rc = RenameFile(src, dst); if (IsError(rc)){ ConsolePrint(" FAILED! 0x%08X\r\n", rc); } else { matched++; } } else { matched++; }
If the user wants to rename the files (not just test them), the function uses the RenameFile function to rename the files. If the function fails, it adds an error to the console. It then increases the count of files that were renamed.
} else { ConsolePrint("Skipping File %s\r\n", name); } }
This prints a message to the user if the file was skipped since it did not match the pattern.
rc = GetNextFile(hFF); if (IsError(rc)) { hasfiles = false; } } CloseHandle(hFF); return matched;
Finally, our script uses the GetNextFile function to get the next file in the directory. If there are no more files, the loop exits and the function returns the count of files that were renamed or would be renamed.
This relatively simple function uses the power of regular expressions to rename thousands of files in a predictable fashion. If we wanted to rename a bunch of jpgs from PowerPoint into GoFiler format (with leading 0s), we could just run:
RenameInDirectory("C:\\PPT", "slide([0-9])\\.jpg", "image_00$1.jpg", false, true); RenameInDirectory("C:\\PPT", "slide([0-9]{2})\\.jpg", "image_0$1.jpg", false, true); RenameInDirectory("C:\\PPT", "slide([0-9]{3})\\.jpg", "image_$1.jpg", false, true);
Combining the Function with a User Interface
This next script combines the function from above with a user interface. This script CAN be dangerous to use as it will rename many files almost instantly. Use the “Test” button before executing the rename operation to see what the script will do. To use the script, save it as an .ls file and run it in GoFiler.
// Renames all Files and Directories in path using pattern int RenameInDirectory(string path, string pattern, string replace, boolean bProcessSub, boolean bRename) { int matched; // Matched items handle hFF; boolean hasfiles; string src, dst, name, nname; dword dw; int rc; hFF = GetFirstFile(AddPaths(path, "*")); if (IsError(hFF)) { return 0; } matched = 0; hasfiles = true; while (hasfiles) { name = GetName(hFF); dw = GetFileAttributeBits(hFF); if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) { if (bProcessSub == TRUE && name != "." && name != "..") { name = AddPaths(path, name); ConsolePrint("Entering Directory %s\r\n", name); matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename); } } else { if (IsRegexMatch(name, pattern)) { nname = ReplaceInStringRegex(name, pattern, replace); ConsolePrint("Renaming File %s to %s\r\n", name, nname); src = AddPaths(path, name); dst = AddPaths(path, nname); if (bRename == true) { rc = RenameFile(src, dst); if (IsError(rc)){ ConsolePrint(" FAILED! 0x%08X\r\n", rc); } else { matched++; } } else { matched++; } } else { ConsolePrint("Skipping File %s\r\n", name); } } rc = GetNextFile(hFF); if (IsError(rc)) { hasfiles = false; } } CloseHandle(hFF); return matched; } // // Supporting Dialog and Main Entry // -------------------------------- #define DLG_LOCATION 201 #define DLG_PATTERN 202 #define DLG_REPLACE 203 #define DLG_BROWSE 101 #define DLG_TEST 102 #define DLG_RENAME 103 #define DLG_SUBDIR 301 #define DLG_EX1 501 #define DLG_EX2 502 int main() { DialogBox("RegexRename", "rr_"); return 0; } int rr_run_function(boolean bRename) { string location; string pattern; string replace; boolean bSub; int rc; // Get Location location = EditGetText(DLG_LOCATION, "Location", EGT_FLAG_REQUIRED); if (location == "") { return ERROR_EOD; } if (IsPathQualified(location) != TRUE) { MessageBox('x', "Location must be a qualified path."); return ERROR_EOD; } // Get Pattern pattern = EditGetText(DLG_PATTERN, "Pattern", EGT_FLAG_REQUIRED); if (pattern == "") { return ERROR_EOD; } // Get Replace replace = EditGetText(DLG_REPLACE, "Replace", EGT_FLAG_REQUIRED); if (replace == "") { return ERROR_EOD; } // Get Subdirectories if (CheckboxGetState(DLG_SUBDIR) == BST_CHECKED) { bSub = TRUE; } else { bSub = FALSE; } ConsolePrint("\r\nBegining Rename...\r\n"); rc = RenameInDirectory(location, pattern, replace, bSub, bRename); if (bRename == TRUE) { MessageBox('i', "%d file(s) renamed.", rc); } else { MessageBox('i', "%d file(s) would be renamed.", rc); } return ERROR_NONE; } int rr_load() { EditSetText(DLG_EX1, "(e.g. slide([0-9]+)\\.jpg )"); EditSetText(DLG_EX2, "(e.g. image_$1.jpg )"); return ERROR_NONE; } int rr_action(int id, int action) { string path; switch (id) { case DLG_BROWSE: path = EditGetText(DLG_LOCATION); path = BrowseFolder("Select Location", path); if (GetLastError() == ERROR_CANCEL) { break; } EditSetText(DLG_LOCATION, path); break; case DLG_TEST: rr_run_function(false); break; case DLG_RENAME: rr_run_function(true); break; } return ERROR_NONE; } #beginresource RegexRename DIALOGEX 0, 0, 300, 104, 0 EXSTYLE WS_EX_DLGMODALFRAME STYLE DS_MODALFRAME | DS_3DLOOK | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU CAPTION "Rename Files with Regular Expressions" FONT 8, "MS Shell Dlg" { CONTROL "Options", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 6, 4, 57, 8, 0 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 36, 9, 260, 1, 0 CONTROL "&Location:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 19, 38, 8, 0 CONTROL "", DLG_LOCATION, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 18, 186, 12, 0 CONTROL "&Browse", DLG_BROWSE, "button", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 243, 18, 45, 12, 0 CONTROL "Process subdirectories (recurse)", DLG_SUBDIR, "button", BS_AUTOCHECKBOX | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 50, 32, 186, 12, 0 CONTROL "&Pattern:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 49, 38, 8, 0 CONTROL "", DLG_PATTERN, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 47, 170, 12, 0 CONTROL "", DLG_EX1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 222, 49, 76, 8, 0 CONTROL "&Replace:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 65, 38, 8, 0 CONTROL "", DLG_REPLACE, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 63, 170, 12, 0 CONTROL "", DLG_EX2, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 222, 65, 76, 8, 0 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 6, 80, 290, 1, 0 CONTROL "&Test", DLG_TEST, "button", BS_DEFPUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 136, 86, 50, 14, 0 CONTROL "Re&name", DLG_RENAME, "BUTTON", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 191, 86, 50, 14 CONTROL "Close", IDOK, "BUTTON", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 246, 86, 50, 14 } #endresource
The Next Step
If you want to test what you’ve learned here and your Legato skills, try taking the above script and editing it to allow for partial matches in filenames.
David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission. |