UCD Interface

Introduction – Technical Information

Regex Engine

The RegexFormat 9 app ( site: www.regexformat.com ) uses Boost Regex 1.65.1

engine source code and it is compiled into this application's binary.

( site: http://www.boost.org/doc/libs/1_65_1/libs/regex/doc/html/index.html )

The boost source has been modified to get the desired affects this application uses.

Externally, this heavily modified engine is available to run (test) regular expressions

in a Find / Find-Next / Replace paradigm. It is also used for the Benchmark suite.

The engine usage is permanently set to Perl mode and uses all of its capabilities.

List of engine modifications:

· Allow Back Reference to undefined groups (not yet parsed).

· Allow Nested Back Reference.

· Correct Non-word boundary construct \B, at begin/end of string.

· Full support for all Unicode 12 Properties and Character Names, \p{}, \N{}.

· Show named capture group names in Match Results output window

· Parse and handle Property constructs inside classes [..\p{}..\P{}..]

· Single character (non-brace) shortcut abbreviation for GCM properties.

· Correct line break construct \R when parsing in expanded mode.

· Correct intersection results of a negated class with negative classes.

Example:. [^\W\D] will match digits only.

UCD Interface Technical Info

UCD is the acronym for Unicode Character Database.

This interface is used to browse all Unicode 12 Properties and Character names.

( site: The Unicode Consortium - http://unicode.org/ )

The regex engine is built to parse and process a thousand or so Properties and

0x10FFFF character names. Basically, all that’s available using the latest ICU library.

The current version used is ICU4 - 64.2 Library (as of this release) and is for Unicode 12

and CLDR 35.

( site: International Components for Unicode - http://site.icu-project.org/ )

How this Interface Works

Properties and/or Character Names can be browsed and added to the Regex Cache list.

Individual list items can be negated or not and selected (check column) to be used

in a Regular Expression that is run against the code point range 0 - 0x10FFFF to get

a set of characters that match that regex.

The output is the result of running the regex with our engine. Since this is the same as

running the regex from the testing portion of the application, it can be used, and

included in a larger regex, and it works the same way.

The regex Construct handed to the engine, is a containing class of the chosen Properties

and/or Names. The container class can be negated as well, allowing an extremely

precise expression.

String Limitations

The U+000000 Code point will never match. This is due to the fact that

the sample target String (Ustring32) is composed of all the characters in the range

of 0 - 0x10FFFF. For display purposes, NULL's are interpreted as the end of string.

Therefore the U+000000 code point is excluded from the string.

Since the target string is UTF-32, all other code points will match.

This includes any leading / trailing surrogate CP or any non-characters.

Regex Cache

The regex cache is always visible. It is the list of added Properties and / or Names.

The basic form Items are added from a corresponding | Add > | button located at

several locations in the pages.

The maximum number of items allowed in the regex cache is 2000.

Columns can be sorted by clicking a field on the header control.

Note that sorting the Property field sorts the raw strings, not the enclosing regex syntax..

Column's 1, 2 and 3 control the items state

Checking an item in one of these columns will add or remove a state bullet in the column.

Column 1: The check box. A checked item is included in the regular expression.

Column 2: Negate the property. Items can be independently negated.

Column 3: Use POSIX syntax. Typically [: :] or [: ^ :] if negated. The default is \p{ }.

Column 4 is the only place to change selection / focus

Normal List Control keys apply. Multiple selection is possible.

Note that if multiple items are selected when the check column is clicked, it will change all

selected items to correspond. Exception is if the <SHIFT> key is held down, only that item's

check is affected.

Misc Control

Double clicking an item in column 4 will navigate to its source location,

A context menu is available via right button mouse click. Here items can be copied or it's

property value / type aliases modified as needed.

The | Delete | and | Modify | buttons contain drop down menu's to manipulate the list.

The arrow buttons will move items up or down in the list. This is useful if a desired output

is to be copied for use elsewhere.

Double clicking an item will go to its source in the Properties browser or Names browser

(see below).

Properties Page

This page is the browser for all the Properties available (obtained) from the ICU library.

This reflects the Unicode 9 specification. Its usage is intuitive.

These lists contain the Raw property Type and Values text. Syntax is added when added

to the Regex cache, and its syntax can be manipulated from the cache.

The property Type list is on the left. The property Value list is on the right.

Each item selection also populates an accompanied Alias combo box.

The alias can default to long or short name using the respective check boxes.

Not all property Types have Value entries (future expansion?).

Types only with values can be filtered using the filter check box if desired.

Items can be added to the Regex cache using the | Add > | button. That button contains a

drop down menu where all values can be added (if desired) or multiple values can be

selected, then added.

Note that if a Type is added without a value and run as a regex, a construct error message

will pop up. This causes no harm, simply un-check the item in the cache.

Search Input Box

This is used to search all Property Types and Value strings.

A separate modeless list dialog will pop out to show the results.

To see all Types and Values (about 979) enter the special string <all>.

Properties can be added to the Cache using the | Add > | button and multiple items are

selectable. Double clicking an item will go to its source in the Properties browser.

Codepoints Page

The Codepoints page is the place to get output from the constructed regular expression.

and each option in their drop down menus, all use the check marked Regex cache items to

construct a regular expression object that runs on the codepoint string

(but see String Limitations above).

The regex is constructed inside a character class that itself can be negated using the

Negate Container check box.

Finally, a positive quantifier is added to the container class. The regex is run in a global

fashion on the codepoint string then joins all matches into a string that is displayed in

the result window.

The Get Character Names button performs a special function.

Using the regex match output, it runs a query through the Names page that results

in highlighting just those names ( see below Names Page section ).

Custom-Rx Page

The Custom-Rx page is similar in function to the Codepoints page.

The difference is that on this page, the Regex Object is editable.

Mix, add and extend syntax and constructs here to get the desired output.

Available regex flags are Expanded and Dot-All.

Checked properties from the Regex cache can be added with the | Add Properties |

button, or can be pasted as desired.

Notes :

If the regex does not consume any characters, for example by using

this (?= . ) , no codepoints will be matched. Something must be consumed.

Also, only group 0 ( the overall codepoint matches ) will be displayed in the output.

Unique Page

The Unique page is similar in function to the Codepoints and Custom-Rx page.

The difference is that on this page, the Regex Object is missing.

Instead, paste or type a string into the Input box.

When the the | Get Unique Codepoints | button is pressed, the resulting

output displays filtered unique characters of that string.

At this point, the result can be examined with the same features available in the

Codepoints or Custom-Rx pages.

Notes :

When the input string is examined, any lone-surrogates will be skipped as this is

not valid in the UTF-32 context used. A message will be displayed if this occurs.

Names Page

The names page has a dual functionality.

First and foremost, it is a browser for all Character Names available (obtained) from the

ICU library. This reflects the Unicode 9 specification.

Second, it can overlay a highlighted set of items from the result string of a Codepoints

search by clicking the Get Character Names button on that page

( see above Codepoints Page section ) .

Controlling the Names list box:

Note - this is a virtual list of code points 0 - 0x10FFFF.

The window displays 16 code point names in a fixed page. It is scrolled in the normal

fashion and otherwise imperceptible that it is a virtual list.

Many controls are on this page to help with navigation.

Sliders - These each scroll the name list (up or down) the value range from their

respective byte position.

Example ( in hex terms ):

0x 10 00 00 Left slider controls this range

0x 00 FF 00 Middle slider controls this range

0x 00 00 FF Right slider controls this range

Click on a slider, drag it or use the mouse scroll wheel.

If the mouse wheel is used, the range will wrap, ticking the adjacent slider as it goes.

This is a fast way to get to a certain place while browsing.

Note - if the <SHIFT> key is held down while wheel scrolling, the slider position increments

by 1, otherwise it increments by 16. This is a way to zoom in on a location.

Mouse Wheel Scroll – If the focus is NOT on a slider, a font combobox,

or an abbreviation combobox, the name list will respond to the wheel scroll.

Useful when using the Next / Prev highlight buttons.

For example, the focus stays on those buttons until another control is clicked OR

the mouse wheel is engaged.

The mouse position can hover over the buttons and scrolling can be done without

having to click the list to change focus. Multiple selection can be done while navigating

with the buttons. See Query Names below.

List by Name

Viewing the list defaults to a Codepoint sort.

To view the list as a Name sort, click the List by Name check box.

When this is checked for the first time, an index table is initially built which

takes about 10 seconds. After that, a switch between sorts is instantaneous.

Note that the top visible item in the current page is preserved when switching sorts.

If you want to locate an item between sorted views, it should be located at the top

of the visible list. When in this mode, the location text box displays a decimal number

to indicate the top index’s place in the list, since it is now being viewed alphabetically.

Abbreviation Combo Box

When in List by Name mode, a combo box becomes visible that contains a list

of three – letter items that correspond to the text in the name list. These are

jump points that moves the top index to that unique point.

If the edit portion is selected (focused) the mouse wheel will scroll through the

entries without the combo drop-down. Any change in the combo’s selection moves

the name list to that point.

Setting the Character Field Font

The font of the character field (2) is dynamically controlled by the font combo box.

The application selects Arial Unicode MS by default when installed ( if detected ).

The current selected font will be preserved between application sessions.

Add Button

Press this button to add selected items to the regex cache.

Note - this will be in the form of the standard \N{name} construct and can't be negated.

Query Names

The Get Character Names button from the Codepoints Page sends a matched character

string to this page. A highlight index of those characters is created (instantly) then the

items are highlighted.

To turn the highlight off, press the Hide Query check box. The highlight returns when

it is un-checked.

To navigate the highlighted set use the Next / Prev highlight buttons.

As a navigational aid, the up or down buttons stop on the first and last item of a

continuous block. This makes it easy to quickly detect blocks. Highlight is maintained in

the sort modes, so it's easy to view the alphabetic / codepoint relationship of the set.

These buttons will auto-repeat by either left mouse button or keyboard enter key.

The mouse repeat rate is set at about 150 ms for slow viewing.

In contrast, the keyboard rate will go as fast as your hardware allows ( 30 ms ? ).

Query Info

The query info rectangle is a static control used to monitor the mouse over for a Tool Tip

that displays the Regex used to obtain the highlight set, the number of highlighted items,

and its state ( visible or hidden ).

Context Menu

Right click the names list to get a context menu.

There you can copy either selected items or highlighted items in different formats.

You can then paste it to anywhere needed.

Note that copying the entire names list will temporarily consume about 35 megabytes

and takes about 3 seconds.

The Floating Name Page

There is a button on the top right of the page with a picture of a Tab and arrows pointing

either towards or away from it.

Press this button to open or hide a secondary floating Names window.

This is modeless dialog with identical functionality as the page, and is especially useful

When doing a side-by-side comparisons of different query sets and with the Property

page tables.

The default behavior of a names Query is to open and show the results in the floating

window. This behavior can be changed by selecting the appropriate Query action

on the Usage page ( this ).

Or, the | Get Character Names | button on the Codepoints page contains a dropdown

menu where the Query output can be explicitly specified.

There is also a Sync button on the top right, next to the hide / open button.

This button will make the other Name window’s top index align with the top index of

the page from which it was pressed. This is useful when doing a side-by-side

comparison of an alphabetical with a codepoint listing.