Difference between revisions of "Machine Learning"
(→Installation) |
|||
(33 intermediate revisions by 3 users not shown) | |||
Line 5: | Line 5: | ||
=Installation= |
=Installation= |
||
− | Prerequisites: |
+ | Prerequisites: Python 3.7.x and [https://www.h2o.ai/products/h2o/ H2O.ai 3.x] |
− | 1. Download '''Python version 3. |
+ | 1. Download '''Python version 3.7.x''' from website https://www.python.org/downloads/ |
− | 2. Install python and add python to PATH environment variable |
+ | 2. Install python and add python to the PATH environment variable |
+ | |||
+ | 3. Go to this link http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/2/index.html and navigate to the "INSTALL IN PYTHON" tab. The instructions there are also shown below. |
||
+ | |||
+ | Install dependencies (prepending with `sudo` if needed): |
||
− | 3. Go to this link http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/4/index.html and make navigate to the "INSTALL IN PYTHON" tab. The instructions there are also shown below. |
||
− | Install dependencies (prepending with `sudo` if needed): |
||
pip install requests |
pip install requests |
||
pip install tabulate |
pip install tabulate |
||
− | pip install scikit-learn |
||
− | pip install colorama |
||
pip install future |
pip install future |
||
At the command line, copy and paste these commands one line at a time: |
At the command line, copy and paste these commands one line at a time: |
||
− | pip uninstall h2o |
+ | pip uninstall h2o # if a permission error occurs, prepend this command with `sudo` |
− | pip install http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/4/Python/h2o-3.16.0.4-py2.py3-none-any.whl |
||
+ | pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/2/Python/h2o-3.36.0.2-py2.py3-none-any.whl |
||
− | Success if response messages have "Successfully installed h2o-3.16.0.4" |
||
+ | Success if response messages have "Successfully installed h2o-3.36.0.2" |
||
− | Test1: From the command line type in "python -V", It is successful success if it replies with a python version like "Python 3.6.2". If the python version is 2.x then there is a problem, possibly with paths that needs to be addresssed. |
||
+ | |||
+ | |||
+ | Test1: From the command line type in <code>python -V</code>. It is successful if it replies with a python version like "Python 3.7.x". If the python version is 2.x then there is a problem, possibly with paths that needs to be addresssed. |
||
Test2: On the command line, go into python. In python: |
Test2: On the command line, go into python. In python: |
||
Line 37: | Line 39: | ||
=Work Flow= |
=Work Flow= |
||
− | To use Machine Learning, first create a Globalyzer project with scans in the Globalyzer Workbench. |
+ | To use Machine Learning, first create a Globalyzer project with scans in the Globalyzer Workbench. Notice that the scan results includes a '''Predicted''' value. |
+ | |||
+ | [[File:scan_result.png|800px]] |
||
+ | |||
+ | The example shows that we are in the '''All Scans Issues''' Scan View. For this view there are 4,202 issues and all are denoted as "Predicted". Since no machine learning has been done yet, all the issues have a Prediction value of '-' (Unpredicted). The Predicted count is the sum of the following Prediction values: |
||
+ | *- |
||
+ | *T |
||
+ | *ML NULL |
||
+ | *Pending |
||
+ | *ML TRUE |
||
+ | |||
+ | Not included in the Predicted count is: |
||
+ | *F |
||
+ | *DFP |
||
+ | *ML FALSE |
||
+ | |||
+ | This can be seen in the '''Scan Views -> Manage Scan Views...''' popup. The Predictions in red are tallied in the Predicted count and the black ones are not. |
||
+ | |||
+ | [[File:All_scan_issues.png|800px]] |
||
+ | |||
+ | At the Scan Results view, right mouse click on some issues that you determine are not real issues, and choose <b>Mark prediction: FALSE (F)</b> from the menu. Please mark the prediction of several issues as false before applying Machine Learning. |
||
+ | |||
+ | <br> |
||
[[File:mark_prediction_as_false_bigger.png|800px]] |
[[File:mark_prediction_as_false_bigger.png|800px]] |
||
+ | |||
+ | <br> |
||
After marking the prediction of several issues as false, please select <b>Machine Learning->GO!</b> |
After marking the prediction of several issues as false, please select <b>Machine Learning->GO!</b> |
||
+ | <br> |
||
− | [[File:Machine_learning_go.png|600px]] |
||
+ | |||
+ | [[File:Machine_learning_go.png|700px]] |
||
+ | |||
+ | <br> |
||
And wait for the predicting process to finish. |
And wait for the predicting process to finish. |
||
+ | <br> |
||
− | [[File:Machine_learning_results.png|800px]] |
||
+ | [[File:Machine_learning_results.png|700px]] |
||
− | Possible prediction values for each issue are: |
||
+ | <br> |
||
− | * <code>-</code>: Issue has not been marked by a user and Machine Learning hasn't been invoked yet |
||
+ | |||
− | * <code>N</code>: Projects that ran before 6.1 are marked as N if Machine Learning is not applied. It corredsponds to another version of "-" |
||
+ | <b>GO!</b> will predict every active issue that doesn't already have a prediction value of <code>T</code>, <code>F</code>, <code>P</code>, or <code>DFP</code>. Machine Learning will assign <code>ML True</code> if it believes the issue is a true issue, <code>ML False</code> if it believe the issue is a false issue, and <code>ML NULL</code> if it can't decide. |
||
− | * <code>Negative</code>: filtered issues are predicted as <code>Negative</code> and used to train Machine Learning. |
||
+ | |||
− | * <code>T</code>: Marked by a user as a real issue to train Machine Learning |
||
+ | Here is a listing of all possible prediction values for issues: |
||
− | * <code>F</code>: Marked by a user as a false issue to train Machine Learning |
||
+ | |||
− | * <code>P</code>: Marked by a user as a pending issue; used to indicate that issue has been reviewed, but undecided if real issue or not |
||
− | * <code> |
+ | * '''<code>-</code>''': Issue has not been marked by a user and Machine Learning hasn't been invoked yet |
+ | * '''<code>N</code>''': Projects that ran before 6.1 are marked as N if Machine Learning is not applied. It corredsponds to another version of "-" |
||
− | * <code>ML True</code>: Machine learning prediction that the issue is a true issue, i.e. the issue should be refactored. |
||
− | * <code> |
+ | * '''<code>Negative</code>''': filtered issues are predicted as <code>Negative</code> and used to train Machine Learning. |
− | * <code> |
+ | * '''<code>T</code>''': Marked by a user as a real issue to train Machine Learning |
+ | * '''<code>F</code>''': Marked by a user as a false issue to train Machine Learning |
||
+ | * '''<code>P</code>''': Marked by a user as a pending issue; used to indicate that issue has been reviewed, but undecided if real issue or not |
||
+ | * '''<code>DFP</code>''': Issue marked as False Positive on Dashboard and can be ignored |
||
+ | * '''<code>ML True</code>''': Machine learning prediction that the issue is a true issue, i.e. the issue should be refactored. |
||
+ | * '''<code>ML False</code>''': Machine Learning prediction that the issue is a false issue and can be ignored |
||
+ | * '''<code>ML NULL</code>''': Machine Learning cannot make a prediction, so must be considered a true issue |
||
Line 67: | Line 103: | ||
Once you are satisfied with the prediction results, the issues with a prediction value of <code>T</code>, <code>P</code>, <code>ML True</code>, or <code>ML NULL</code> are the true issues that need to be addressed. The issues with a prediction value of <code>F</code>, <code>ML False</code>, or <code>DFP</code> can be ignored. The suggested way to view the predicted active issues is to select <b>Scan Views->All Predicted Active</b>. |
Once you are satisfied with the prediction results, the issues with a prediction value of <code>T</code>, <code>P</code>, <code>ML True</code>, or <code>ML NULL</code> are the true issues that need to be addressed. The issues with a prediction value of <code>F</code>, <code>ML False</code>, or <code>DFP</code> can be ignored. The suggested way to view the predicted active issues is to select <b>Scan Views->All Predicted Active</b>. |
||
+ | |||
+ | <br> |
||
+ | |||
+ | [[File:All_predicted_active.png|700px]] |
||
+ | |||
+ | <br> |
||
+ | |||
+ | Using the <i>All Predicted Active</i> Scan View hides all the issues with <code>F</code>, <code>ML False</code>, and <code>DFP</code> prediction values. |
||
+ | |||
+ | <br> |
||
+ | |||
+ | [[File:All_predicted_active_results.png|700px]] |
||
+ | |||
+ | <br> |
||
+ | |||
+ | |||
+ | After reviewing the predictions made by Machine Learning, you can either leave them as they are (<code>ML True</code>, |
||
+ | <code>ML False</code>, <code>ML NULL</code>), or you can set the <code>ML True</code> ones explicitly to <code>T</code>, |
||
+ | the <code>ML False</code> ones explicitly to <code>F</code>, and the <code>ML NULL</code> ones to either <code>T</code> or |
||
+ | <code>F</code>. The difference between <code>ML True</code> and <code>T</code> is that the former is a suggestion from Machine Learning, while the later is an explicit decision. If you leave them as suggestions, then further runs of <b>GO!</b> may end up with different suggestions. |
||
<u>Tips:</u> |
<u>Tips:</u> |
||
Line 80: | Line 136: | ||
* Sort on Prediction: Instead of 'Searching', sorting on the Prediction columns can also help you see better some of the potential categories some issues fall into. |
* Sort on Prediction: Instead of 'Searching', sorting on the Prediction columns can also help you see better some of the potential categories some issues fall into. |
||
</ol> |
</ol> |
||
− | |||
− | =Prediction Reports= |
||
− | |||
− | You can use full-fledged Machine Learning as described above in the Workflow section, or you can use Prediction Reports to save and share prediction markings. When you select GO! Globalyzer generates both the Prediction Report and Machine Learning files for a scan. |
||
− | |||
=Globalyzer Lite= |
=Globalyzer Lite= |
||
Line 90: | Line 141: | ||
To use Machine Learning when running Globalyzer Lite, you need to first use the Globalyzer Workbench and invoke Machine Learning as described above on the desired scans in the project. Then, when you export the project to Lite, choose which scans you would like to use the Machine Learning. |
To use Machine Learning when running Globalyzer Lite, you need to first use the Globalyzer Workbench and invoke Machine Learning as described above on the desired scans in the project. Then, when you export the project to Lite, choose which scans you would like to use the Machine Learning. |
||
− | [[File:Lite export ml.png| |
+ | [[File:Lite export ml.png|700px]] |
This sets flags in the generated project definition file (PDF) to use Machine Learning. |
This sets flags in the generated project definition file (PDF) to use Machine Learning. |
||
When Lite runs a PDF, it first scan the source using rules in the rule set. Then Lite applies the Prediction Report for the scan, if it exists. Finally, if the scan has been configured to use Machine Learning, Lite applies the Machine Learning files for the scan. The generated XML report will include Machine Learning prediction information, which will be read by the Dashboard. The Dashboard will only display <code>Active</code> issues with a prediction value of <code>-</code>, <code>T</code>, <code>P</code>, <code>ML True</code>, or <code>ML NULL</code>. <code>Active</code> issues with a prediction value of <code>F</code>, <code>ML False</code>, or <code>DFP</code> will not be displayed on the Dashboard. |
When Lite runs a PDF, it first scan the source using rules in the rule set. Then Lite applies the Prediction Report for the scan, if it exists. Finally, if the scan has been configured to use Machine Learning, Lite applies the Machine Learning files for the scan. The generated XML report will include Machine Learning prediction information, which will be read by the Dashboard. The Dashboard will only display <code>Active</code> issues with a prediction value of <code>-</code>, <code>T</code>, <code>P</code>, <code>ML True</code>, or <code>ML NULL</code>. <code>Active</code> issues with a prediction value of <code>F</code>, <code>ML False</code>, or <code>DFP</code> will not be displayed on the Dashboard. |
||
+ | |||
+ | =Prediction Reports= |
||
+ | |||
+ | Prediction Reports save prediction markings you made for a scan. When you select <b>Machine Learning->GO!</b>, Globalyzer automatically generates both the Prediction Report and Machine Learning files for the scan. If you prefer, you can use the Prediction Report without Machine Learning, allowing you to save and share your prediction markings without Machine Learning making predictions. Select <b>Machine Learning->Export Prediction Report</b> to export the Prediction Report for a scan; select <b>Machine Learning->Import Prediction Report</b> to update your scan results with the saved markings. |
||
+ | |||
+ | <br> |
||
+ | |||
+ | [[File:Export_prediction_report.png|700px]] |
||
+ | |||
+ | <br> |
||
+ | |||
+ | Prediction Reports are stored in the '''<code><project>/lingoport</code>''' directory. Lite will automatically use the Prediction Report, if it exists, when scanning. The Dashboard knows to ignore issues with F and ML False (if using Machine Learning) prediction values. In this way, the Workbench and Dashboard will see the same active results. |
||
=Continuous Integration System= |
=Continuous Integration System= |
Latest revision as of 19:41, 2 February 2022
Contents
Machine Learning Overview
Machine Learning prediction is a Globalyzer Workbench and Globalyzer Lite feature that helps users more quickly identify the real issues in their source code. We suggest applying Machine Learning as a follow-up step to scanning with Rule Sets. It helps to determine which candidate issues are indeed i18n issues.
Installation
Prerequisites: Python 3.7.x and H2O.ai 3.x
1. Download Python version 3.7.x from website https://www.python.org/downloads/
2. Install python and add python to the PATH environment variable
3. Go to this link http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/2/index.html and navigate to the "INSTALL IN PYTHON" tab. The instructions there are also shown below.
Install dependencies (prepending with `sudo` if needed):
pip install requests pip install tabulate pip install future
At the command line, copy and paste these commands one line at a time:
pip uninstall h2o # if a permission error occurs, prepend this command with `sudo`
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/2/Python/h2o-3.36.0.2-py2.py3-none-any.whl
Success if response messages have "Successfully installed h2o-3.36.0.2"
Test1: From the command line type in python -V
. It is successful if it replies with a python version like "Python 3.7.x". If the python version is 2.x then there is a problem, possibly with paths that needs to be addresssed.
Test2: On the command line, go into python. In python:
> import h2o > h2o.init()
This should complete without errors.
Work Flow
To use Machine Learning, first create a Globalyzer project with scans in the Globalyzer Workbench. Notice that the scan results includes a Predicted value.
The example shows that we are in the All Scans Issues Scan View. For this view there are 4,202 issues and all are denoted as "Predicted". Since no machine learning has been done yet, all the issues have a Prediction value of '-' (Unpredicted). The Predicted count is the sum of the following Prediction values:
- -
- T
- ML NULL
- Pending
- ML TRUE
Not included in the Predicted count is:
- F
- DFP
- ML FALSE
This can be seen in the Scan Views -> Manage Scan Views... popup. The Predictions in red are tallied in the Predicted count and the black ones are not.
At the Scan Results view, right mouse click on some issues that you determine are not real issues, and choose Mark prediction: FALSE (F) from the menu. Please mark the prediction of several issues as false before applying Machine Learning.
After marking the prediction of several issues as false, please select Machine Learning->GO!
And wait for the predicting process to finish.
GO! will predict every active issue that doesn't already have a prediction value of T
, F
, P
, or DFP
. Machine Learning will assign ML True
if it believes the issue is a true issue, ML False
if it believe the issue is a false issue, and ML NULL
if it can't decide.
Here is a listing of all possible prediction values for issues:
-
: Issue has not been marked by a user and Machine Learning hasn't been invoked yetN
: Projects that ran before 6.1 are marked as N if Machine Learning is not applied. It corredsponds to another version of "-"Negative
: filtered issues are predicted asNegative
and used to train Machine Learning.T
: Marked by a user as a real issue to train Machine LearningF
: Marked by a user as a false issue to train Machine LearningP
: Marked by a user as a pending issue; used to indicate that issue has been reviewed, but undecided if real issue or notDFP
: Issue marked as False Positive on Dashboard and can be ignoredML True
: Machine learning prediction that the issue is a true issue, i.e. the issue should be refactored.ML False
: Machine Learning prediction that the issue is a false issue and can be ignoredML NULL
: Machine Learning cannot make a prediction, so must be considered a true issue
If you find that issues predicted as ML False
are indeed issues, please right mouse click on the issue and select Mark prediction: TRUE(T); the next time you run GO!, Machine Learning will learn your correction. If you are not satisfied with the prediction results, please continue marking more issues as F
or T
, and rerun Machine Learning.
Once you are satisfied with the prediction results, the issues with a prediction value of T
, P
, ML True
, or ML NULL
are the true issues that need to be addressed. The issues with a prediction value of F
, ML False
, or DFP
can be ignored. The suggested way to view the predicted active issues is to select Scan Views->All Predicted Active.
Using the All Predicted Active Scan View hides all the issues with F
, ML False
, and DFP
prediction values.
After reviewing the predictions made by Machine Learning, you can either leave them as they are (ML True
,
ML False
, ML NULL
), or you can set the ML True
ones explicitly to T
,
the ML False
ones explicitly to F
, and the ML NULL
ones to either T
or
F
. The difference between ML True
and T
is that the former is a suggestion from Machine Learning, while the later is an explicit decision. If you leave them as suggestions, then further runs of GO! may end up with different suggestions.
Tips:
- View all issues, including filtered issues: One way to understand some of the Machine Learning results is to show all issues, including filtered ones. When an issue is predicted as
ML False
, it is easier to see why when it is surrounded by filtered issues with the same type of patterns. - Scan->Search in Scan Results:
- Search on the Prediction column for issues which are
ML False
. From the Search panel, you can right click on the items to change the prediction with Globalyzer->Mark Prediction: TRUE (or FALSE). - Search on the Prediction column for issues which are
ML NULL
andML True
: This will help you see which issues are predicted as true issues.
- Search on the Prediction column for issues which are
- Sorting: In the Scan Results, a few sorts can be useful:
- Sort on Issue: Lots of similar issues should be treated the same, either
F
orT
. By sorting on issues, you may see a pattern to use as a category for Machine Learning - Sort on File: A sequence of lines may have patterns which can be used to categorize issues quickly. Use multi-selection to accelerate the
T
orF
categorization. - Sort on Prediction: Instead of 'Searching', sorting on the Prediction columns can also help you see better some of the potential categories some issues fall into.
- Sort on Issue: Lots of similar issues should be treated the same, either
Globalyzer Lite
To use Machine Learning when running Globalyzer Lite, you need to first use the Globalyzer Workbench and invoke Machine Learning as described above on the desired scans in the project. Then, when you export the project to Lite, choose which scans you would like to use the Machine Learning.
This sets flags in the generated project definition file (PDF) to use Machine Learning.
When Lite runs a PDF, it first scan the source using rules in the rule set. Then Lite applies the Prediction Report for the scan, if it exists. Finally, if the scan has been configured to use Machine Learning, Lite applies the Machine Learning files for the scan. The generated XML report will include Machine Learning prediction information, which will be read by the Dashboard. The Dashboard will only display Active
issues with a prediction value of -
, T
, P
, ML True
, or ML NULL
. Active
issues with a prediction value of F
, ML False
, or DFP
will not be displayed on the Dashboard.
Prediction Reports
Prediction Reports save prediction markings you made for a scan. When you select Machine Learning->GO!, Globalyzer automatically generates both the Prediction Report and Machine Learning files for the scan. If you prefer, you can use the Prediction Report without Machine Learning, allowing you to save and share your prediction markings without Machine Learning making predictions. Select Machine Learning->Export Prediction Report to export the Prediction Report for a scan; select Machine Learning->Import Prediction Report to update your scan results with the saved markings.
Prediction Reports are stored in the <project>/lingoport
directory. Lite will automatically use the Prediction Report, if it exists, when scanning. The Dashboard knows to ignore issues with F and ML False (if using Machine Learning) prediction values. In this way, the Workbench and Dashboard will see the same active results.
Continuous Integration System
When using Machine Learning in our Continuous Integration System, make sure everything in the project/lingoport directory is pushed to your repository. This ensures that everything downstream will work as intended.
Machine Learning FAQ
1. If I change the status of an issue, will Machine Learning work?
Yes, it will work. When you change the status of an issue, the prediction of the issue will be set by default. For example, if you move an issue to ToDo
, the prediction will be marked as True
; if you move an issue to Ignore
/Invalid
, the prediction will be marked as False
. You may still manually mark the prediction of any issue, overriding the default. Then, when Machine Learning is invoked, it will use the True
and False
values as input to make predictions on the remaining Active
issues.
Please note that as of our 6.1.1 release, we are deprecating the Invalid
, Ignore
, and ToDo
statuses. Please mark the prediction of an issue rather than change its status.
2. In Scan Results, what is the meaning of the Predicted count?
With the introduction of Machine Learning, Scan Results has a new count display, called Predicted. This is the number of Active
issues with a prediction value of -
, T
, P
, ML True
, or ML NULL
. This is the number of Active
issues that will appear on the Dashboard. Note that Active
issues with a prediction value of F
, ML False
, or DFP
will not be displayed on the Dashboard.
3. How does Machine Learning work?
We use h2o.ai to analyze the issue, the issue code line, and the issue reason. Based on filtered issues and your marked false issues, Machine Learning will try to find similar Active
issues and set their prediction to ML False
. Machine Learning prediction may be different per invocation; you won't have the exact same results every time. In addition, Machine Learning needs input to learn from, so if you only mark one issue as False
, Machine Learning may not be able to find other similar issues.
4. What kind of Machine Learning algorithm does Globalyzer use?
Globalyzer uses the Gradient Boosting Machine (GBM) algorithm. Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
5. What is H2O.ai? Do I have to install it?
H2O.ai is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. And yes, to use Machine Learning, you must install H2O.ai to your system. It's an in-memory platform so you don't need to worry about the security of your code and data.