UCD Interface
Introduction – Technical Information
Regex Engine
The RegexFormat 9 app ( site: www.regexformat.com ) uses Boost Regex 1.65.1
engine source code and it is compiled into this application's
binary.
( site: http://www.boost.org/doc/libs/1_65_1/libs/regex/doc/html/index.html
)
The boost source has
been modified to get the desired affects this application uses.
Externally, this
heavily modified engine is available to run (test) regular expressions
in a Find / Find-Next / Replace paradigm. It is also used for the Benchmark
suite.
The engine usage is permanently
set to Perl mode and uses all of its capabilities.
List of engine modifications:
·
Allow Back Reference to undefined groups (not
yet parsed).
·
Allow Nested Back
Reference.
·
Correct Non-word
boundary construct \B, at begin/end of string.
·
Full support for all Unicode 12 Properties
and Character Names, \p{}, \N{}.
·
Show named capture group names
in Match Results output window
·
Parse and handle Property constructs inside
classes [..\p{}..\P{}..]
·
Single character (non-brace) shortcut
abbreviation for GCM properties.
·
Correct line
break construct \R when parsing in expanded mode.
·
Correct intersection
results of a negated class with negative classes.
Example:. [^\W\D]
will match digits only.
UCD Interface Technical Info
UCD is the acronym for Unicode Character Database.
This interface is used to browse all Unicode 12 Properties and Character names.
( site: The
Unicode Consortium - http://unicode.org/ )
The regex engine is built to parse and process a thousand or so
Properties and
0x10FFFF character names. Basically,
all that’s available using the latest ICU library.
The current version
used is ICU4 - 64.2 Library (as of this release) and is for Unicode 12
and CLDR 35.
( site: International
Components for Unicode - http://site.icu-project.org/ )
How this
Interface Works
Properties and/or Character Names can be browsed and added to the Regex Cache list.
Individual list items
can be negated or not and selected (check column) to be used
in a Regular Expression that is run against the
code point range 0 - 0x10FFFF to get
a set of characters that match that regex.
The output is the
result of running the regex with our engine. Since
this is the same as
running the regex from the
testing portion of the application, it can be used, and
included in a larger regex, and
it works the same way.
The regex Construct handed to the engine, is a
containing class of the chosen Properties
and/or Names. The container class can be negated as
well, allowing an extremely
precise expression.
String
Limitations
The U+000000 Code point will never match. This is due to
the fact that
the sample target String (Ustring32)
is composed of all the characters in the range
of 0 -
0x10FFFF. For display purposes, NULL's
are interpreted as the end of string.
Therefore the U+000000 code point is
excluded from the string.
Since the target
string is UTF-32, all other code points will match.
This includes any leading
/ trailing surrogate CP or any non-characters.
Regex Cache
The regex cache is always visible. It is the list of added
Properties and / or Names.
The basic form Items
are added from a corresponding |
Add > | button located at
several locations in the pages.
The maximum number of
items allowed in the regex cache is 2000.
Columns can be sorted
by clicking a field on the
header control.
Note that sorting the
Property field sorts the raw strings, not the enclosing regex
syntax..
Column's 1, 2 and
3 control the items state
Checking an item in
one of these columns will add or remove a state bullet in the column.
Column 1: The check box. A checked item is included in
the regular expression.
Column 2: Negate the property. Items can be
independently negated.
Column 3: Use POSIX syntax. Typically [:
:] or [: ^ :] if negated. The default is \p{ }.
Column
4 is the only place to change selection / focus
Normal List Control keys apply. Multiple selection
is possible.
Note that if multiple items are selected when the check column is clicked,
it will change all
selected items to correspond.
Exception is if the <SHIFT> key is held down, only that item's
check is affected.
Misc Control
Double clicking an item in column 4 will navigate to its source location,
A context menu is available via right button mouse click. Here items can be
copied or it's
property value / type aliases modified as needed.
The | Delete | and | Modify | buttons contain drop down menu's to
manipulate the list.
The arrow buttons will move items up or down in the list. This is useful if
a desired output
is to be copied for use
elsewhere.
Double clicking an item will go to its source in the Properties browser or
Names browser
(see below).
Properties Page
This page is the browser for all the Properties available (obtained) from
the ICU library.
This reflects the Unicode 9 specification. Its usage is intuitive.
These lists contain the Raw property Type and
Values text. Syntax is added when added
to the Regex
cache, and its syntax can be manipulated from the cache.
The property Type list is on the left. The property Value
list is on the right.
Each item selection also populates an accompanied Alias combo box.
The alias can default to long or short name using the respective check
boxes.
Not all property Types have Value entries (future expansion?).
Types only with values can be filtered using the filter check box if
desired.
Items can be added to the Regex cache using the |
Add > |
button. That button contains a
drop down menu where all values
can be added (if desired) or multiple values can be
selected, then added.
Note that if a Type is added without a value and run as a regex, a construct error message
will pop up. This causes no
harm, simply un-check the item in the cache.
Search Input Box
This is used to search all Property Types and Value strings.
A separate modeless list dialog will pop out to show the results.
To see all Types and Values (about 979) enter the special string <all>.
Properties can be added to the Cache using the | Add > |
button and multiple items are
selectable. Double clicking an item will go to its source
in the Properties browser.
Codepoints Page
The Codepoints page is the place to get output
from the constructed regular expression.
The buttons | Get Characters | and |
Get Hex Conversion | and | Get Character Names |
and each option in their drop
down menus, all use the check marked Regex cache
items to
construct a regular expression
object that runs on the codepoint string
(but see String Limitations above).
The regex is constructed inside a character class
that itself can be negated using the
Negate Container check box.
Finally, a positive quantifier is added to the container class. The regex is run in a global
fashion on the codepoint
string then joins all matches into a string that is displayed in
the result window.
The Get Character Names button performs a special
function.
Using the regex match output, it runs a query
through the Names
page that results
in highlighting just those
names ( see below Names Page section ).
Custom-Rx Page
The Custom-Rx page is similar in function to the Codepoints page.
The difference is that on this page, the Regex Object
is editable.
Mix, add and extend
syntax and constructs here to get the desired output.
Available regex flags are Expanded and Dot-All.
Checked properties from the Regex cache can be
added with the | Add Properties |
button, or can be pasted as desired.
Notes :
If the regex
does not consume any characters, for example by using
this (?= . ) , no codepoints will be matched. Something must
be consumed.
Also, only group 0 ( the overall codepoint
matches ) will be displayed in the output.
Unique Page
The Unique page is similar in function to the Codepoints and Custom-Rx page.
The difference is that on this page, the Regex
Object is missing.
Instead, paste or type
a string into the Input box.
When the the | Get Unique Codepoints
| button is pressed,
the resulting
output displays filtered unique
characters of that string.
At this point, the result can be examined with the same features available in the
Codepoints or Custom-Rx
pages.
Notes :
When the input string is examined,
any lone-surrogates will be skipped as this is
not valid in the UTF-32 context used. A
message will be displayed if this occurs.
Names Page
The names page has a dual functionality.
First and foremost, it is a browser for all Character Names available (obtained)
from the
ICU library. This reflects the Unicode 9 specification.
Second, it can overlay a highlighted set of items from the result string of
a Codepoints
search by clicking the Get Character Names button on that page
( see above Codepoints
Page section ) .
Controlling the Names list box:
Note - this is a virtual list of code points 0 - 0x10FFFF.
The window displays 16 code point names in a fixed page. It is scrolled in
the normal
fashion and otherwise imperceptible
that it is a virtual list.
Many controls are on this page to help with navigation.
Sliders - These each scroll the name list (up or down) the value range from their
respective byte position.
Example ( in hex terms ):
0x 10 00 00 Left slider controls this range
0x 00 FF 00
Middle slider controls
this range
0x 00 00 FF
Right slider controls this range
Click on a slider, drag it or use the mouse scroll wheel.
If the mouse wheel is used, the range will wrap, ticking the adjacent slider
as it goes.
This is a fast way to get to a certain place while browsing.
Note - if the
<SHIFT> key is held down while wheel scrolling, the slider position
increments
by 1, otherwise it increments by 16. This is a way
to zoom in on a location.
Mouse Wheel Scroll – If the focus is NOT on
a slider, a font
combobox,
or an abbreviation combobox, the name list will respond to the wheel
scroll.
Useful when using the Next / Prev highlight buttons.
For example, the
focus stays on those buttons until another control is clicked OR
the mouse
wheel is engaged.
The mouse position can hover over the buttons and scrolling can be done
without
having to click the list to change
focus. Multiple selection can be done while navigating
with the buttons. See Query
Names below.
List by Name
Viewing the list defaults to a Codepoint
sort.
To view the list as a Name sort, click the List by Name check box.
When this is checked for the first time, an index table is initially built which
takes about 10 seconds. After
that, a switch between sorts is instantaneous.
Note that the top visible item in the current page is preserved when
switching sorts.
If you want to locate an item between sorted views, it should be located at
the top
of the visible list. When in this mode, the location text box displays a decimal
number
to indicate the top index’s
place in the list, since it is now being viewed alphabetically.
Abbreviation Combo Box
When in List
by Name mode, a combo box
becomes visible that contains a list
of three – letter items that correspond to the text in the name list. These are
jump points that moves the top
index to that unique point.
If the edit portion is selected (focused) the mouse wheel will scroll
through the
entries without the combo
drop-down. Any change in the combo’s selection moves
the name list to that point.
Setting the Character Field Font
The font of the character field (2) is dynamically controlled by the font
combo box.
The application selects Arial
Unicode MS by default when installed ( if detected ).
The current selected font will be preserved between application sessions.
Add Button
Press this button to add selected items to the regex
cache.
Note - this will be in the form of the standard \N{name} construct
and can't be negated.
Query Names
The Get Character Names button from the Codepoints Page sends a matched character
string to this page. A highlight
index of those characters is created (instantly) then the
items are highlighted.
To turn the highlight off, press the Hide Query check
box. The highlight returns when
it is un-checked.
To navigate the highlighted set use the Next / Prev
highlight buttons.
As a navigational aid, the up or down buttons stop on the first and last item
of a
continuous block. This makes it
easy to quickly detect blocks. Highlight is maintained in
the sort modes, so it's easy to
view the alphabetic / codepoint relationship
of the set.
These buttons will auto-repeat by either left mouse button or keyboard enter key.
The mouse repeat rate is set at about 150 ms for slow viewing.
In contrast, the keyboard rate will go as fast as your hardware allows ( 30 ms ? ).
Query Info
The query info rectangle is a static control used to monitor the mouse over for a
Tool Tip
that displays the Regex used to obtain the highlight set, the number of
highlighted items,
and its state ( visible or
hidden ).
Context Menu
Right click the names list to get a context menu.
There you can copy either selected items or highlighted items in different
formats.
You can then paste it to anywhere needed.
Note that copying the entire names list will temporarily consume about 35
megabytes
and takes about 3
seconds.
The Floating Name Page
There is a button on the top right of the page with a picture of a Tab and
arrows pointing
either towards or away from it.
Press this button to open or hide a secondary floating Names window.
This is modeless dialog with identical functionality as the page, and is
especially useful
When doing a side-by-side
comparisons of different query sets and with the Property
page tables.
The default behavior of a names Query is to open and show the results in
the floating
window. This behavior can be
changed by selecting the appropriate Query action
on the Usage
page ( this ).
Or, the | Get Character Names | button on the
Codepoints page contains a dropdown
menu where the Query output can be explicitly specified.
There is also a Sync
button on the top right, next to the hide
/ open button.
This button will make the other Name
window’s top index align with the top index of
the page from which it was
pressed. This is useful when doing a side-by-side
comparison of an alphabetical
with a codepoint listing.