Scanning Unsupported Languages

From Lingoport Wiki
Revision as of 17:46, 30 August 2016 by Masnes (talk | contribs) (Locale Sensitive Methods)
Jump to: navigation, search

Out of the box, Globalyzer supports many languages. However, some newer or less used languages are not supported. A few examples include Scala, Rust, and Go.

Scanning Unsupported Languages

Globalyzer can be used to scan many languages that are not officially supported. This process requires starting with a rule set for a similar language, then adapting the rule set against the new language's naming conventions.

Picking a Base Rule Set to Adapt

When picking an appropriate rule set as a base, it is important to insure that the base language syntax is close to the new language's. Look for the following similarities:

Functions/Methods start and end in the same way:

  • Are the methods/functions always followed by parenthesis '('?
    • For instance, post-function parenthesis are optional in Perl.
  • Are methods used?
    • Prefixed with '.' ?

Statement terminators match:

  • Is ';' used to terminate statements?
  • Are statements terminated on every new line?

Strings are defined similarly:

  • Are single quotes used?
  • Double quotes?
  • Are there additional quoting rules?
    • E.g. In Javascript, quotes may be used (unescaped) as regex components. These quotes may not be balanced, which can break parsing for many languages.
      • var regex = /quote: ', then text/;

Testing

Once you've picked what looks to be an applicable rule set base language, it is important to test it on your new language. The Globalyzer Workbench is ideal for this purpose. Create a Workbench project from some existing code in the language. A 50k LOC codebase provides a good balance between variety of code and scanning time. Create or use a default rule set for the base language.

Scan the project. Look for errors in the scan results. Create an example detection and filtering rule for Embedded Strings (include at least 1 method based filter/detection), Local Sensitive Methods and General Patterns. Check to see that these rules are being properly applied.

If there is unique syntax in the new language not in the base language, then some errors may be unavoidable. For instance, String Operand Filters from C/C++ cannot be applied to Rust pattern matching. Look for major issues. You may also wish to test with a few different rule set types to see what works best.

Adapting the Rule Set

Once you have found a suitable rule set, it must be adapted to work with your target language.

Different languages handle localization differently. Before updating a rule set to support a new language, it is important to get a sense of what i18n support/issues are present in your language.

For example, consider multibyte characters in UTF-8. If you wish to count the number of characters in a string, then your will need to know the details of how strings are represented in your language. For example, Perl has strong unicode support - length($string) will always return the number of characters present. C, on the other hand, represents strings as char arrays (char[]) by default. strlen, and similar functions, all return the byte length of the given string, but this is not always equivalent to the character count.

Some topics to consider:
  • Possible hard coded fonts/encodings/date formats
  • User facing string methods
    • E.g. Java's JLabel
  • Collation methods
    • String.append()?
  • Locale specific Date/Time methods
  • Encoding handling and methods
  • Number formatting methods
  • String formatting methods


Once you have researched i18n issues in your language, it is time to update the rule set.

Removing Old Filters/Detections

Rule sets for every language start with default filtering and detection rules. Many of the rules that applied to the original language may not apply to the language you are adapting for. The best time to remove these is after you have sense of your language's i18n issues, but before you have begun adding rules of your own. If you pay attention to the rules that you remove, you may also notice some potential issues that you missed when doing your original research.

Look through the default rules and remove any rules that do not apply to your new language. If unsure about a rule, you can disable it without removing it.

Detecting Language-Specific i18n Issues

Now that you have cleaned out the rule set, it is time to add detection rules for your language.

Embedded Strings

Before adding embedded string detection patterns, it is important to understand how this category works:

  1. Every string in the application is cataloged
  2. If a dictionary scan is enabled, strings that do not contain dictionary words are ignored.
  3. Strings matched by filtering rules are ignored.
  4. Ignored strings matched by a String Retention/Detection filter are retained in the results.

You wish to add string retention patterns to match language specific output / labeling classes/methods/functions. E.g. Java's JLabel. However, most strings should be detected even without this change.

General Patterns

Create general pattern rules for issues that are due to a CONSTANT, or a hard coded format. For instance, a general pattern might be used to look for 'ASCII' or 'mm/dd/yy'.

Locale Sensitive Methods

Review how your language handles the following categories of behavior:

  • Collation
  • Date/Time
  • Encoding
  • Number Formatting
  • String Formatting

Then look for and add relevant methods to these locale sensitive method categories.

Static File References

If your language references static files from a unique file type, you may wish to add them here. This will give you a reference for where these files are used in the code base. Some static files may need to be localized or handled in a locale sensitive manner.

Using your new Rule Set

Once you have finished with the above, you will be ready to scan code from your language with the new rule set. We recommend that you use the newly created rule set as a base rule set, and have project specific rule sets inherit from it. This properly separates language level and project level concerns, without requiring duplication of work.