Globalyzer Server and Rule Sets
Contents
- 1 How do I get started with Rule Sets?
- 2 Can I modify the Default Rule Sets?
- 3 In what order are Rule Set detection and filtering applied?
- 4 I changed the Rule Set but the Workbench keeps using the old Rule Set. How can I use the updated Rule Set?
- 5 We have lots of Rule Sets: Is there a way to organize Rule Set to better manage them?
- 6 Where can I find help on “General Pattern” issues found in C++ code scanning?
- 7 Does Globalyzer fix JavaScript locale-sensitive method issues?
- 8 What is the fix for the JavaScript locale-sensitive method charAt()?
- 9 How do I add new JavaScript locale-sensitive methods or modify the description and help for existing methods?
- 10 When you create your rule set, can you specify the file extensions you would like scanned?
- 11 How do I create a rule set for the 'C' language?
- 12 Should we always use Microsoft-specific "C++ Generic" mappings for data types, routines, and other objects?
- 13 How to define a regex pattern in a rule set
- 14 How do I create a regex pattern for Registered, Trademark, Copyright symbols?
- 15 Searching for characters in a specified language
- 16 Searching for characters in a specified language with character ranges
- 17 Proper String Formatting for Internationalization
- 18 Identifying Class or Variable Type in Java Rule Sets
- 19 Clean User History
How do I get started with Rule Sets?
If you are using Lingoport's hosted Globalyzer Server, log into https://www.globalyzer.com/gzserver/help/usersguide/ruleSets.html to get started with rule sets.
If the server is hosted on site: https://<your server URL>/help/usersguide/ruleSets.html
Can I modify the Default Rule Sets?
If you are hosting the Globalyzer Server, the Administrator user can modify any of the Default Rule Sets.
If Lingoport is hosting the Globalyzer Server, you will not be able to modify the Default Rule Sets, but you can modify one of your own Rule Sets and then allow your team members to either use it (sharing) or copy it. This Rule Set can then be the starting point for their internationalization scanning and filtering process.
In what order are Rule Set detection and filtering applied?
There are four categories of results:
- Embedded Strings
- Locale-Sensitive Methods
- General Patterns
- Static File References
For Embedded Strings,
- All strings are found as issues initially, then
- The string filters are run (all except String Method Filter) to see what should be filtered
- String Line Filters
- String Content Filters
- String Statement Filetes
- String Variable Filters
- The retention rules are run to see what should be retained
- String Method Patterns
- String Variable Patterns
- String Content Patterns
- String Method Filters is run last. We used to run String Method Filters along with the other filters, but found in practice that String Method Filters should trump all detections, and be run last.
For the other three categories, the patterns are run first to find the issues, and then the filters are run to remove.
I changed the Rule Set but the Workbench keeps using the old Rule Set. How can I use the updated Rule Set?
When rule sets are modified on the server, select Project->Reload Rule Sets
to refresh the client.
We have lots of Rule Sets: Is there a way to organize Rule Set to better manage them?
Globalyzer supports Inherited Rule Sets.
Base rule sets can be created and maintained by an individual and then project level rule sets can extend the base rule set. The extended rule set would have everything from the base rule set, plus whatever is added/modified.
When an individual introduces a new rule or modifies a rule in the project level rule set, other projects wouldn't be affected.
Where can I find help on “General Pattern” issues found in C++ code scanning?
If you login to the Globalyzer Server and look at the General Patterns for your rule set, it will often give information on why Globalyzer is scanning for this pattern.
Additionally, if you go to the Help system on the Globalyzer Server, there are various topics on C++ internationalization. In particular, click on Globalyzer Server Reference->Locale-Sensitive Methods and Properties ->C++ Programming Language->C++ Rule Sets. This help page talks about Unicode support in the various C++ rule sets. For example, usually a C++ program will be compiled with single-byte character strings. These single-bytes cannot support Unicode characters, which require more than 1 byte. That is the main reason why our C++ General Patterns scan for character strings: You will have to make sure to modify them if they are to hold Unicode strings.
Does Globalyzer fix JavaScript locale-sensitive method issues?
Globalyzer detects methods that could be an issue when supporting multiple languages, but has no specific fixing built in. This is because it’s not always clear that the method is an actual issue and the fix may involve some reworking that requires manual decisions. However, for some programming languages, we have written internationalization (i18n) help for the method that explains the reason for the detection as well as suggestions on what change might need to be made.
When we don’t provide specific i18n help, we provide links to external help on the method, which sometimes provide information about i18n considerations.
What is the fix for the JavaScript locale-sensitive method charAt()?
In this case, Globalyzer detected charAt because it is a method that indexes into a string. If that string contains a translation, then the location of the character may have changed or it may not be the same character. The fix is really dependent on the usage. If the string is locale-independent, then you can insert an Ignore This Line comment so that Globalyzer will no longer flag this issue.
How do I add new JavaScript locale-sensitive methods or modify the description and help for existing methods?
If you have a Globalyzer Team Server license, you can add to or modify the default Locale-Sensitive Methods for each programming language so that your users will also see your changes whenever they create a new Rule Set.
If you’re using our hosted globalyzer.com server, you can add to or modify the Locale-Sensitive Methods of a specific Rule Set that you create and then share with other Globalyzer users that are part of your team. That way, your team members will benefit from the work you have done in determining the resolution for Locale-Sensitive Method issues. This approach applies to all Rule Set rules, such as General Patterns, Static File References, and Embedded Strings.
When you create your rule set, can you specify the file extensions you would like scanned?
The default for a java rule set is to scan files with the following extensions: java, jsp, jspf, and jspx. If you are only interested in jsp files, you can disable the others. Steps to do this:
- Log in to the Globalyzer Server and select your java rule set
- Select Configure Source File Extensions
At this point, there are several options available:
- Uncheck the file extensions you are not interested in, but may use in the future
- Add New File Extensions, say if you are using an extension not in this list
- Modify the file extension. Select the extension and in the next panel make changes and Update.
- Delete File Extensions that you don't want. Select the extension and in the next panel select Delete.
- Add File Extension Defaults. If you have removed the defaults and want them reinstated.
How do I create a rule set for the 'C' language?
For the C language, you should choose one of our C++ variants. The main ones are
- ANSI UTF-8
- ANSI UTF-16
- Cross Platform UTF-8
- Cross Platform UTF-16
- Windows Generic
- Windows MBCS (multibyte character set)
- Windows Unicode
- If you are using GNU C, you will want to use one of the ANSI rule sets.
- UTF-8 if that’s how you want to support Unicode
- UTF-16 if you will be using wide-character calls to support UTF-16 Unicode.
- If you are just running on Windows, then you can choose a Windows variant.
- If you’ll be running on both Windows and Unix, then you’ll need a cross-platform rule set. The difference between the variants is the list of locale-sensitive methods Globalyzer will scan for in your code.
To get a better feel for the differences, you can create a few rule sets with the different variants and look at the locale-sensitive methods defined.
Should we always use Microsoft-specific "C++ Generic" mappings for data types, routines, and other objects?
If your C++ application is running on Windows platform, then it is a good idea to use the generic mappings because, in theory, you can switch from single-byte to wide-character support with just a flip of a compiler switch. This allows your teammates to continue to be able to run the application, while you internationalize it. At some point, you then flip to a Unicode compile. Globalyzer can help you in this process. Just create a C++ Rule Set with the Windows Generic variant so that when Globalyzer scans your C++ code, it will include detection of methods that are not of the generic form.
How to define a regex pattern in a rule set
When specifying regex patterns with UTF-8 characters, you need to specify the characters like this: \uXXXX where XXXX is the hexidecimal number for the character.
For example, if I have this string: "中国" Then I would specify this general pattern to find it: \u4E2D\u56FD
http://www.regular-expressions.info/unicode.html
How do I create a regex pattern for Registered, Trademark, Copyright symbols?
To detect/filter characters such as ® (Registered), please use the Unicode code point in the regex. For instance,
- ® (Registered): \u00AE
- ™ (Trademark): \u2122
- © (Copyright): \u00A9
Searching for characters in a specified language
It may be useful to detect all strings that contain characters from a specific language. For instance, finding all strings of Chinese characters within an application. This can be done using Unicode scripts. Here are a few examples:
Chinese: \p{script=Han}
Korean: \p{script=Hangul}
Some languages, such as Japanese, combine multiple scripts.
Japanese: [\p{script=Hiragana}\p{script=Katakana}\p{script=Han}]
Searching for characters in a specified language with character ranges
Unicode scripts are not the only way to find Characters within a specified language. While they are the simplest means to do this, they are not supported in all environments. For instance, Java 1.6 does not support regex searches using Unicode scripts.
Another solution is to use Unicode character ranges. For instance, CJK Unified Ideographs represent the most common Chinese characters. The basic set of CJK Unified Ideographs are all contained within the character range \u4e00-\u9fd5. To search for strings containing these characters, create a String Retention Pattern with the following regex pattern:
[\u4e00-\u9fd5]+
Additional Chinese Ideograph characters fall within the ranges of:
- \u3400-\u4db5 (CJK Unifed Ideographs extension a)
- \u20000-\u2a6d6 (CJK Unifed Ideographs extension b)
- \u2a700-\u2b734 (CJK Unifed Ideographs extension c)
Multiple character ranges can be used to create a single expanded character set, like so:
[\u4e00-\u9fd5\u3400-\u4db5\u20000-\u2a6d6\u2a700-\u2b734]+
The above regex expression will find strings of one or more Chinese characters from any of the Ideograph sets.
Character Ranges for Korean, Chinese and Japanese
Korean (Hangul)
- (Hex 1100-11ff) (Decimal 4352-4607) (Jamo)
- (Hex a960-a97c) (Decimal 43360-43388) (Jamo extended a)
- (Hex d7b0-d7ff) (Decimal 55216-55295) (Jamo extended b)
- (Hex 3130-318f) (Decimal 12592-12687) (Hangul compatibility Jamo)
- (Hex ff00-ffef) (Decimal 65280-65519) (half width and full width forms, includes english alphabet)
Chinese (Han)
- (Hex 4e00-9fd5) (Decimal 19968-40917) (CJK Unified Ideographs, ~500 page pdf)
- (Hex 3400-4db5) (Decimal 13312-19893) (CJK Unified Ideographs ext. a, ~100 page pdf)
- (Hex 20000-2a6d6) (Decimal 131072-173782) (CJK Unified Ideographs ext. b, ~400 page pdf)
- (Hex 2a700-2b734) (Decimal 173824-177972) (CJK Unified Ideographs ext. c, ~40 page pdf)
Japanese (Katakana, Hiragana, Kanji)
At a glance:
- (Hex 3000-30ff) (Decimal 12288-12543) (Punctuation,Hiragana,Katakana)
- (Hex 31f0-31ff) (Decimal 12784-12799) (Katakana phonetic extensions)
- (Hex 1b000-1b0ff) (Decimal 110592-110847) (Katakana supplement)
- (Hex 3400-9faf) (Decimal 13312-40879) (Common / Uncommon Kanji and Rare Kanji)
- Exclude 4db1-4dff if you wish to avoid a section between sets of Kanji
- (Hex ff00-ffef) (Decimal 65280-65519) (Half width and full width forms, includes english alphabet)
Full details:
- (Hex 3000-303f) (Decimal 12288-12351) (Punctuation)
- (Hex 3040-309f) (Decimal 12352-12447) (Hiragana)
- (Hex 30a0-30ff) (Decimal 12448-12543) (Katakana)
- (Hex 31f0-31ff) (Decimal 12784-12799) (Katakana phonetic extensions)
- (Hex 1b000-1b0ff) (Decimal 110592-110847) (Katakana supplement)
- (Hex 4e00-9faf) (Decimal 19968-40879) (Common and Uncommon Kanji)
- (Hex 3400-4dbf) (Decimal 13312-19903) (Rare Kanji)
- (Hex ff60-ffdf) (Decimal 65376-65503) (Half width Japanese punctuation and Katakana)
- (Hex ff00-ffef) (Decimal 65280-65519) (Half width and full width forms, includes english alphabet)
Proper String Formatting for Internationalization
Most programming languages support some kind of parameter substitution. This mechanism can be used to refactor concatenation into proper resources. For instance, you could have a string like {0} is {1} years old.
for a Java string using the MessageFormat class to replace parameter '0' with the value of 'user' and parameter '1' with the value of 'age'. The resource in the resource bundle would then look like the following:
RES1={0} is {1} years old. RES2=Title RES3=Navigate
The French translator can then make use of the context and create the corresponding French .properties file:
RES1={0} a {1} ans. RES2=Titre RES3=Naviguer
Identifying Class or Variable Type in Java Rule Sets
Introduction
Globalyzer Server version 5 introduced new types of rules for Java rule sets based on a improved, in-depth i18n parsing.
Say that you want to filter strings passed as parameters to a method called get, which is a fairly common method name. You can specify the class name on which the rule should be applied. Let's say the I18nUtil class and the UIUtil both have a 'get' method.
- You can specify a string method filter on get for variables of type I18nUtil and
- You can specify a string method pattern on get for variables of type UIUtil.
- Static methods are also handled (I18nUtil.get("string") for example)
- You can also have an overarching get rule for all variable types by leaving the Class or Variable Type(s) field empty, as opposed to listing all the class types on which a detection or a filter must apply.
User Interface
Class names are specified as part of the rules. The following UI shows how to configure a new String Method Filter
for the smalljava
illustrative rule set.
- Name: That the name of the rule. It could be something like
i18n get
- Pattern: That is the pattern which would match the method name. It could be something as simple as
get
- Class or Variable Type(s): This is a pattern which would match the class name. It could be something like
company.project.util.I18nUtil
- Description: The description or the reason for this filter could be something like "I18nUtil get method string parameters do not need to be externalized into a resource bundle for i18n purposes"
- Help Page: The link to a more verbose help page which may indicate the context and the reason for the filter.
Note: The value of Class/Variable Type(s) is a string delimited list of fully qualified classes and types. If the field is empty, the methods are filtered/detected on all variables or classes accordingly.
Type of Rules
The first rules to be impacted are Java String Method Filters and Java String Method Pattern retection rules. Passing a String parameter to a Java method has the added level of knowing the variable class name. Filtering on a method called "setText" will allow to differentiate between classes or objects of different types with that method being invoked. This will make for better rules and finer results.
As Lingoport explores other possibilities, other rules and other programming languages will be covered.
Example
import company.project.util.Dbg; // A fully qualified class name import company.project.ui.*; // Label is in the company.project.ui package. [...] Dbg dbg = Dbg.getInstance(); Label lbl = new Label(); [...] dbg.setText("Create User action taken."); [...] label.setText("Menu");
This snippet of code does have strings.
- The class
company.project.Dbg
is a debug class and the text method puts the String parameter into a database for support purposes. That string is not visible to the end user. In that instance, thesetText
method on a variable of typecompany.project.util.Dbg
should be filtered. The Strings in statements likeDbg.setText("a string");
orcompany.project.util.Dbg.setText("another string");
would also be filtered.
- The variable
lbl
of classcompany.project.ui.Label
represents a text area in the User Interface and thesetText
method passes a user visible string. The stringMenu
passed to thissetText
method should be flagged: It needs to be externalized out of the code into a resource bundle.
The String Method Filter rule would be configured the following way:
- Name: Debug setText
- Pattern: setText
- Class or Variable Type(s): company.project.util.Dbg
- Description: String parameters passed to the Debug setText must not be externalized and translated.
- Help Page: <blank>
The String Method Pattern retention rule would be configured the following way:
- Name: User Interface Label setText
- Pattern: setText
- Class or Variable Type(s): company.project.ui.Label
- Description: String parameters passed to the Label setText method must be externalized and translated.
- Help Page: <blank>
When the rules are applied:
- The string passed to the Dbg variable setText method is filtered out and does not show up as a candidate issue.
- The string passed to the Label variable setText method is detected and does show up as a candidate issue.
Clean User History
When users login to their Globalyzer Server accounts, that login history is stored on the Globalyzer Server. Likewise, when users log into the Globalyzer Client (Workbench or Lite) and execute scans, that login and scanning history is also stored on the Server.
Administrators and Managers can view and delete this data on the Server by selecting Clean User History from their home page.
Additionally, a cron job runs nightly to remove data. Our hosted server is configured to keep data for 90 days. If you install your own Enterprise Server, this value is configurable via the gzserver.cleanup.daysToKeep setting in the GzserverConfig.groovy file. If not configured, it defaults to 90 days.