SummaryThe Complete Instrumentation and Analysis Tool (CIAT) or IAT is used to support the development of safe and secure Software and Hardware. Using IAT allows a program to show traceabilility of software and VHDL down to the source code level and to automate many of the tedious manually intensive tasks associated with code analysis and testing. To support the test phase it instruments the software source code, performs data reduction and analysis of data captured during lab testing on the target, and supports regression test analysis. During the design and coding phase it provides a number of automated source code checks to support analysts during code walk throughs and code inspections. At all phases its shows where req's are met in the code and it identifies discrepancies between the project SRDB and the actual as built software. It accomplishes these services by reading in the actual software source code and the req's from the SRDB. Its strength is that it was developed while supporting the needs of a real program. Services were added and tested to its Internet framework as those needs arose. It is based on web technology and provides HTML reports with hyperlinks that allow analysts to jump between actual lines of code, test instrumentation points, requirements, and code analysis findings. This evolution has taken the IAT to the point where its services can be used for reverse engineering and reuse analysis where users are able to examine processing threads from previous test runs which represent typical use cases of the software. It also includes very specialized and unique web enabled search, filter, and compare services to support on the fly ad hoc analysis as security related question surface.
IntroductionThe IAT is an Internet based automation aid that integrates front end system engineering to back end design, implementation, and test. This integration is not approximate or based on analysis, but exact and based on the real requirements text and source code products. This exact interface between requirements, source code, and test results provides many benefits to a project including unambiguous traceability to the source code level, analysis reports of the source code for potential problems as part of walk throughs, and test results that allow users to literally select hyper text links to software source and requirements from actual test data. As the IAT has evolved, its services have expanded to support regression analysis, responses to customer questions, and re-use analysis. These services have been enabled with the introduction of specialized searching, filtering, comparing, and replacing features. The Instrumentation and Analysis Tool is used to support the instrumentation of software source code, perform data reduction and analysis of data captured during lab testing on the target, and support regression test analysis. The tool is a framework that includes: Instrumentation, Simulation, Data Reduction, Lab Reduction, and Regression. The tool also supports a number of automated source code checks to support analysts during code walk throughs and code inspections. The automated analysis comes in the form of HTML reports and messages which are either labeled as attentions, warnings, cautions, or notes. When a web server such as Apache is enabled, the IAT also includes general services such as searching, filtering, comparing, and replacing. These services evolved as the needs arose during the intial use of the IAT-apr-2002 . IAT-sep-2002. The needs included answering cusomter questions, regression analysis, and reverse engineering. There are some power point presentations which are a good introduction into the tool. Overview . Capabilities . Technical Insights . Walkthgroughs . Automated SV Approach . Change History . Example Process . Business Startup . Training Outline . Lunch Presentation . White Paper . Brochure . Customer Demo IAT is in the processes of expanding to become a best practice. The IAT was created on Project-O and is being transferred to other programs. IAT tailoring and metrics are available. Background Information Includes
Defensive Coding Summary.doc
|
IAT Services Note: scripts are disabled
Projects Project-IAT ![]() Project-O Project-N Enterprize Mgmt Enterprise Request View E Requests Download E Log Project Mgmt Project Request View P Requests Download P Log Import Doors General Instrumentation Search Filter Compare Regression Search & Replace Project ErrorCodes Libraries Analysis Source Result Test Area Test Data Web Enabled Test Area Test Data Templates Instrumentation Codegen Le-Summary Code Analysis Stats-Problems Entropy Keywords Hamming-Distances Watch Dog Timer Error-Handling Abstracts Comments Regression Metrics Reporting Sw Req Accounting Simulate Data-Reduction SRDB Update Test Cases z-Test Case y-Test Case Func Call Seq Clear-Test-Data Test Req Accounting Reqreduction s_Reqreduction Test Analysis Compare Regression Metrics Reporting Re-Use Support Execution Threads PUI Translation Search Filter |
The primary purpose of the reports is to provide information. They are not to be used to directly create action items or issues. They are a tool used by evaluators together with other data to generate action items and issues. The stats-problems report is the only report that contains automatic detectors for finding potential problems. The remaining reports use keyword searches and correlation to provide a view of the software and requirements.
All reports are structured in a similar fashion. There is an introduction area with links to lower portions of the reports and links to the analysis, source, result, and other broad areas outside the current report. Following the introduction, there is a parameters area which contains most of the settings to generate the current report. This parameters area can be moved to the end of the report. Some of the newer reports have these parameters at the end of the report. Its an experiment... If a web server is enabled, these parameters can be changed and a new report can be generated. The next area is a summary area which summarizes all the information which is found in the details area.
The details area contains cross links to the source code and to other reports. The Source link opens the original source file. If the browser file extension is enabled to open an editor, the file can be opened, changed, and re-saved. This allows a software developer to quickly make changes resulting from this analysis. The Result link opens the instrumented source file. It is the file that is modified by the IAT. The link within the log event resembling a PUI opens the data reduction report and takes the user to the referenced requirement number. The left most link is the IAT internal file number and line number of the current file in the detail view. Selecting any of these links will automatically open the IAT listing report and take the user to the appropriate file and line number.
The detectors and reports are in a constant state of evolution. This evolution is based on the fact that the analysts constantly need to respond to new questions as they arise on a program and the IAT is a tool to help them answer those questions. For example, if it is determined that a type recast has caused a problem, the analyst has 3 choices: (1) do nothing, (2) print out the hardcopy and look for type recasts in all the source code using a yellow highlighter, or (3) spend 15 minutes and add a new IAT service to look for type recasts. Many folks do not understand this simple concept and they resist adding to the IAT because it is viewed as adding more work to the software effort, which by the way has nothing to do with getting basic functionality out the door. However, from a program perspective there is only one right answer, anything else would leave the program open to risk. One could even argue that the DR should not be closed until all type recasts are checked.
The most current view of IAT services is located at tailoring-metrics.html. The following sections describe some of these reports, but they do not represent the complete set available for the IAT.
The possible problems are generated by the logevents.pl program and appear as a separate logical display in the reports. The possible problems summarize all the anomalies encountered while analyzing the source and generating the messages in the details portion of the reports. The possible problems are:
Fatal Errors | These are artifacts of instrumentation that will lead to compile errors most of the time, such as printf's, breaking if-else sequences, splitting declaration areas, and instrumenting previously instrumented code. |
Header | The .c and .h headers are separately checked for the appropriate fields. Missing fields or missing headers are detected. |
Classification Errors | The header classification field must be marked as defined in the rules, no other markings are permitted. The file naming convention is checked against the header classification marking. Keywords that may trigger a higher classification are compared against the file naming convention. Missing headers against classified file names are detected. |
SV Marking | The SV field is checked for YES or NO marking. Blank or other markings are detected and noted in the detailed message. The SV header field must be present for this check to detect an error. If the SV header field is missing, that is noted in the bad header area. |
CV Marking | The CV field is checked for YES or NO marking. Blank or other markings are detected and noted in the detailed message. The CV header field must be present for this check to detect an error. If the CV header field is missing, that is noted in the bad header area. |
Log Events | Logevents are subjected to several checks to determine if there may be a potential problem. The checks include redundant text, consecutive events without program code in between events, encoding problems (LE SV not LE, not SV, not TE, etc), potentially placing a logevent at the start of a procedure rather than at the conclusion, potentially placing a logevent within a loop. |
Fixed Keywords | Although there is a separate keywords report, there are several keywords that must be explained or removed. The bad fixed keywords form this set. |
Switch Default Balance | All switch statements must be paired with a default statement. Missing default statements are detected exclusive of comments. There is also a reverse check to determine if there is a default with a missing switch. This last case should never occur unless something is wrong during compile or within the regular expression searches in this tool. |
Default SwError Balance | All default statements must call SwError. Missing SwError calls and calls that are commented out are detected. |
Case Break Balance | All case statements should be paired with a break statement. Missing break statements are detected exclusive of comments. There is also a reverse check to determine if there is a break with a missing case. Without a break the code is less robust and prone to disintegration when modified. Break can also be used to exit elsewhere which may skew this result. |
Nested Switches | Nested switch sequences. |
Stacked Cases | Stacked case statements with no code in between. |
Calling Rules | Unique to each program. |
Calling Load | Functions considered are project unique. If a function is called 1 time, it may be a candidate for folding into another function. |
No Error Exits | Most of the software should include an error exit. This is not a hard and fast rule, but may need to be justified. |
Files with: 15 or more Functions | Good practices limit the number of functions in any given source file. |
Files with: 500 or more MOL | Good practices limit the number of LOC in any given source file. |
Functions with: 100 or more LOC | For proper analysis there must be a valid header termination before the start of a function. Good practices limit the LOC in any given function. |
C Functions with: less than 5 LOC | For proper analysis there must be a valid header termination before the start of a function. Good practices limit the LOC in any given function. |
Uncalled Functions | During a peer review this shows the external interface. For a release, it shows uncalled functions. |
Dead Code | This is a very simple detector looking for single line commented out code. Code that is block commented out is not detected. Dead code that is the result of logical errors is not detected. |
Line Length | Good practices limit the number characters in a line. |
do Loops | Considered as a bad practice and leads to maintenance problems. |
goto Statements | Considered as a bad practice and leads to maintenance problems. |
? : Operator | Considered as a bad practice. Its not clear to most software developers. |
++ / -- Within Decision | Considered as a bad practice by some domains. |
Extra Lines between header & function | The heart of this tool is the ability to detect functions. Given the level of reuse, the ability to consistently detect functions was accomplished with this internal rule. Although not required for software review, it is a tool for the analysts if a problem should arise in the report of a software entity. |
NO Lines between header & function | The heart of this tool is the ability to detect functions. Given the level of reuse, the ability to consistently detect functions was accomplished with this internal rule. Although not required for software review, it is a tool for the analysts if a problem should arise in the report of a software entity. |
The messages are generated by the logevents.pl program and appear in the details portion of the reports. They are primarily based on verifying that the log event comments are placed properly into the source code and that a limited subset of source code checks are performed, where practical within the framework of the instrumentation code. The messages are constantly changing. Some are coming, some are going, and some are re-phrased.
Attentions
|
Warnings
|
Cautions
|
Reports are created by a user based on various display filter settings and various parameters. A limited number of predefined reports have been created and saved in the z- series scripts. Logevents.pl generates 5 reports, simulate.pl and data reduction.pl each generate 1 report. The reports are as follows:
Report |
Primary Use |
Codegen.html Keywords.html Hamming-Distance.html Stats-Problems.html error-handling.html cpc-listing.html Abstracts.html Comments.html |
Used by analysts to support SV code walk throughs and SV code inspection. |
Simulate.html Datareduction.html |
Used by the test team to verify that the TOC has been fully instrumented
into the source code. Also used by the test team to perform data reduction and analysis of lab test data. |
Report |
Features |
Codegen.html |
|
Keywords.html |
|
Hamming-Distance.html |
|
Stats-Problems.html |
|
Abstracts.html |
|
Comments.html |
|
Cpc-listing.html |
|
error-handling.html |
|
watchdog_timmer.html |
|
Simulate.html |
|
Datareduction.html |
|
The instrumentation and analysis of the software is performed by 3 scripts: logevents, simulate, and data reduction. The tool processes software that has been commented as described in le-coding-rules.doc. The tool consists of five programs which can be started via either a web browser or a DOS script. The tool processes software stored in a source directory and places newly instrumented source code in a result directory. It also generates reports which are placed in an analysis directory. All directories and file names are user selectable.
The logevents.pl program is used to instrument and analyze the source code. If a web server is enabled, various settings can be selected and executed. The resulting reports can then be save and used as templates for future analysis. If a web server is not enabled, a script can be used to call the logevents.pl program and pass different parameters with each pass of a logevents.pl execution. Sample scripts are provided and they begin with <z-> in the main directory. Currently logevents.pl is executed multiple times by a single z- script to produce the following fixed reports:
When codegen.html is produced the same option causes logevents.pl to save a translation file which will be used in data reduction. The translation file is stored in the analysis directory with the reports as translate.dat.
Once the source code is instrumented two additional programs are available to support the framework, simulate.pl and datareduction.pl.
Simulate.pl accesses the instrumented source code and plays the log events. A file is created that contains only the numerical log event numbers and is stored in the analysis directory as sim.dat. A report simulate.html is also returned which shows the modules, functions, log event numbers assigned by logevents.pl, and log comments. This program is started in the same way as described for logevents.pl.
Datareduction.pl accesses the sim.dat file created by the simulator or the lab test environment, the translate.dat file and the toc.dat and bfw.dat files. It generates a data reduction report datareduction.html which shows the log event numbers, log event comments, TOC program unique identifiers (PUIs), the TOC requirements satisfied by the test run, and the TOC requirements not satisfied by the test run. This program is started in the same way as described for logevents.pl.
A single script can be created to run the reports associated with logevents.pl,
the simulate.pl, and datareduction.pl programs. These scripts duplicate what
would happen if a web server were enabled and users could select different
parameters to produce different reports. However, in the case of the scripts,
the parameters which are passed in are fixed. Examples of these scripts start
with <z->. These scripts process the individual CA folders and all
the CA folders:
z-all-sw.pl z-sim.pl z-rdn.pl z-key.pl
Since the individual script approach above is a maintenance problem, a master
script has been created that takes in 2 or optionally 3 parameters and will
process the software. Examples of using this master script is as follows:
perl z-any.pl wt-csciA wt-release csciA perl z-any.pl wt-csciB wt-release csciB perl z-any.pl pr-csci pr-release_1 csci perl z-any.pl pr-csci pr-release_2 csci
where the 1st parameter is the main directory of where the software resides (CSCI) and the 2d parameter is the location of where the software group under this analysis run exists (release) and the third parameter is the CPC or CSCI name. If the third parameter is not provided, the CPC is calculated by using the second parameter and stripping the numbers and dashes from the name.
Analysis is about the ability to ask questions and slice through different perspectives of data. Doing that quickly or on the fly is one of they key elements to creating a successful analysis platform. The web server is one of the fastest methods for allowing a user to ask those questions and get different perspectives of the data. In the absence of a web server, the scripts can be used, but the focus changes to establishing a definite set of templates. An initial set of templates have been established while supporting 3 different code walk throughs. These templates are the result of hundreds of analysis runs and parameter settings. As more software is subjected to code walk through more scripts (templates) will be developed.
Its not as bad as you may think. Lets get started...
To
get started without a web server run this example
To install software that is to be instrumented
To get started with a web server
Parameters and Settings
The parameters to control the various options are selectable from the browser user interface if the web server is enabled. In the absence of a web server, the parameters need to be modified either in the individual programs or the script(s) which call the programs. The parameter settings form a template. The parameters are as follows:
Parameters |
Comments |
# tagging $classmarking = "Doc Classification"; $classhigh = "Highest Classification"; $classlow = "UNCLASSIFIED"; |
The tagging parameters are used to mark the reports. These are static settings and should not change. |
# build settings $baseline = "IAT previous release"; $future = "1\\.0 |1\\.5 |2\\.0 "; |
The build settings are unique to each build.
The $baseline parameter is used to identify the previous baseline metrics to compare with this baseline metrics set. This is obviously user controlled, but for official releases, the comparisons should be against an official previous release. It should be changed to the previous release with each new release. Since the IAT reads the DOORS requirements and the requirements include a phase attribute, its possible for the IAT to determine and sort through future requirements. The $future parameter should be changed with each official release and should only contain the set that represents future requirements. |
# directory settings | These parameters will change often and are based on the software being processed. |
$dirpath = "p:/project/iat-inst"; | Path to the location of the programs. |
$srcpath = "p:/project/iat-inst"; | Pathto the source files, instrumented files, and the toc.dat file. The path is concatenated with all the location settings. |
$source = "source/xmirror-rp"; | Location of the software source files to be analyzed and instrumented. |
$result = "result"; | Location of the instrumented software source files. |
$analysis = "analysis/all-sw"; | Location of the analysis reports and the translate.dat file |
$simdata = "analysis/all-sw"; | Location of the simulation sim.dat file or lab test results. |
$reduceddata = "analysis/all-sw"; | Location of the data reduction reports |
$report = "s_znew-report.html"; | Name of the report. This name changes constantly depending on whether logevents, simulate, or datareduction is executed. |
# file extensions to analyze $extensions = "\\.c\$|\\.h\$"; |
File extensions that are accessed by the tool. |
# color settings $color1 = "blue"; $color2 = "blue"; $color3 = "red"; $color4 = "green"; $color5 = "purple"; $color6 = "red"; $color7 = "orange"; |
The keyword searches and instrumentation elements are color coded. |
# keyword searches $pdlevents_1 = "nnoonnee"; $pdlevents_2 = "nnoonnee"; $pdlevents_3 = "nnoonnee"; $pdlevents_4 = "nnoonnee"; $pdlevents_5 = "nnoonnee"; |
The color coded keyword searches available to a user. |
# hamming patterns $hamevents = "0x\\w\\w\\w\\w|return.*0x...."; @hamvalues = ('0001','0002'...) |
The pattern used to extract hex data that might corresponding to values that should be a certain hamming distance. The extracted patterns are subjected to the accepted hamvalues. |
# instrumentation initial counter $svlognum = 1000; $dblognum = 5000; $locevent = 7000; $hmievent = 8000; |
The instrumentation codes are automatically assigned by the tool. The starting numbers set by these parameters. |
# instrumentation patterns $svevents = "LE.SV"; $dbevents = "LE.Debug"; |
The instrumentation patterns are defined by these parameters |
$headerformat_c $headerformat_h |
Defines the header format that should be used by the software source. |
# filters $abstract = "0"; $comments = "0"; $srccode = "0"; $header = "0"; $svreq = "1"; $cvreq = "1"; $reqs = "0"; $cfunc = "1"; $showonlysvcv = "0"; $instrument = "0"; |
Enables and disables various display filters that are presented in the
details report.
A "1" means enable the feature which is included in the details report. Note 1: Report in this context refers to a display area report as opposed to a dynamically generated web page which is then saved by the analyst. Note 2: These filters enable and disable functionality. It means that if a function is disabled, such as instrument, the function will not be performed. |
# reports $rptkeywords = "0"; $rpthamvalues = "0"; $rptstats = "0"; $rptproblems = "0"; $rptdetails = "0"; $rptcompare = "0"; |
Enables and disables various reports that are displayed when the dynamically
generated web page is created. The details report tends to be very large
and contains all the attention, warning, caution, notes, and keyword messages.
The compare report is the DIFF between the original source code and the
instrumented source code.
A "1" means generate the unique display report. |
$window = 0; | Many times an analyst is able to print out an event based on a keyword entry, but occasionally the analysts wants content. This parameter allows an analyst to show "n" events after the trigger event. |
The scripts have a set of variable that are set at the top, then a set of variables that are re-set prior to calling the main program. The programs also have the same set of variables that are also set. When the scripts are used, the variable settings in the main programs are ignored. So most of your changes should be isolated to the particular area within a script that is performing a particular analysis run. Each script has these same areas and just points to different software and analysis directories:
set some parameters require "datareduction.pl"; #################### # unique processing #################### #################### # SV CV instrumentation #################### set some parameters &logevents_main; #################### # keywords all #################### set some parameters &logevents_main; #################### # hamming distances #################### set some parameters &logevents_main; #################### # abstracts #################### set some parameters &logevents_main; #################### # comments #################### set some parameters &logevents_main; #################### # stats & problems #################### set some parameters &logevents_main; #################### # simulate #################### require "simulate.pl"; set some parameters &simulate_main; #################### # datereduction #################### require "datareduction.pl"; set some parameters &datareduction_main; |
Testing and Lab Reduction includes several scripts: labreduction.pl, design.pl, clear-test-data.pl, and compare.pl.
Labreduction.pl is a companion web based tool to datareduction.pl except that it operates on real test data rather than simulated test data. The data reduction is performed in real time as the test executes. Anyone on the network can view the test results in real time during the testing.
Design.pl is similar to labreduction.pl except it only shows the software files and functions in the actual calling sequence for a given test thread. This provides an insight into the software which can be used for troubleshooting or reverse engineering.
Clear-test-data.pl is used to clear existing test data files of raw test data and leave the user comments behind for use on future tests. This allows for the creation of templates that can be used to support multiple releases.
The test results analysis begins during informal testing by examining the real time test results.
If the results appear to be reasonable, copy and paste them into the test procedures. They will form the factory template that will be used during formal testing.
As part of an offline analysis, use the IAT reports and software source to locate the test result log events. Examine the software and look for potential missing log events or wrong long events. This is a final examination to confirm that the test results are as expected.
For dry run, create a single complete test results report following the sequence that is expected during formal test. Use these dry run test results to support the formal test.
At the conclusion of all the DRY RUN tests enable the log events found, log events not found, requirements correlation, and requirements coverage reports. Look for log events and requirements that were not captured during the all the test activity. If needed, create new tests to capture the missing requirements.
With each test battery, there is usually more than one instance of a test thread. Examples include log on from different starting points in the system. Use the IAT to compare these different log on approaches. Open up 2 browsers side by side and scan the results. Look for discrepancies as previously defined.
Eventually a repository of test data will develop. IAT includes a very powerful compare service. This compare service can be used to identify differences between test runs and allow the analyst to just focus on the differences. Save these compare results and use them as templates for future test run comparisons. As always the IAT maintains context with all of its saved web page reports so that the analysis can be re-run or modified for a different analysis, such as new baselines. If the compare report confuses you, open up the original reports and do a side by side comparison until you can gain confidence and understanding to work with the compare report.
Finally, analyzing test data well needs to include multiple dimensions. Try to compare similar things within a test thread, battery, or across baselines. Don't be afraid to keep digging when an anomaly surfaces. IAT will produce and organize a great deal of data which will let you dig until you have found an answer or you have located a genuine problem.
Besides the compare service, search and filter are also useful services for test analysis. The filter service allows an analyst to enter the PUIs from one test battery and subtract the PUIs from another test battery yielding the requirements not covered in the starting test battery. This is extremely powerful and allows the analyst to instantly determine test coverage. Search allows the analyst to search anything on the network from software and requirements to test reports and metrics reports.
This tool is started as a DOS application in a directory which contains the test data. The syntax is:
where all [fields] are optional and [blank] or [-] is default or are set
as follows:
|
Raw Test Data |
The file containing the test results is automatically updated by the BOOT application for some test cases. In all other BOOT and RED test cases the file is manually updated by the testers by either copying log events from the emulator window or entering log events based on emulator profile points. Using appropriate file naming conventions, parallel testing can be performed.
This application requires the inclusion of a special file (translate.dat) that is created by the logevent application and stored with the instrumentation analysis results. This file translates the log event codes to the original english equivalents. It also requires a version of the TOC (s_toc.dat) to correlate the TOC requirements with the lab test results.
Analysis Reports | Script Option | Comment |
Lab Data | LabData | This report shows the raw numerical log events captured during lag testing or during a simulation run of a software collection in analysis |
Test Results | Results | This is the test results in log event format. This analysis connects the raw numerical log events captured during lab testing with the original source code comment statements, source file, and C-function. |
Log Events Found | LeFind | This analysis lists the log events found in these test results. |
Log Events NOT Found | LeNoFind | This analysis lists the log events NOT found in these test results. |
PUIs | TocCor TocCov |
This analysis extracts the TOC numbers from the log events captured during lab testing. |
TOC Text Correlation | TocCor | This analysis correlates the log events captured during lab testing with the baseline TOC. If multiple events are associated with the same req, that req is displayed multiple times. The sequence matches the original test events sequence. |
Not in TOC Text | TocCor | This analysis shows the log events that are NOT in the current TOC baseline. For example, if a TOC req is deleted after the software is coded or if the implementor enters the wrong TOC number in the comment, this report will show those events. |
In This CPC Collection | TocCov | The TOC reqs are allocated to CPCs. This analysis shows the TOC reqs that are in this CPC collection. All the reqs in this report area should contain the CPC or CPCs that represent this analysis run. If there are only foreign CPCs, then this software collection has implemented reqs that are implemented elsewhere, or the DOORS data base is in error. If this analysis represents all the software then this report is for the current software baseline. |
Not In This CPC Collection | TocCov | This analysis shows the TOC reqs that should be in the analyzed software but are NOT in this CPC collection. It may be that only a portion of a CPC was analyzed and so there are outstanding reqs. It may be that reqs are not properly allocated in DOORS. It may be that the implementation missed reqs and the software needs to be updated. This report only applies when CPCs are analyzed. |
Not in Software Baseline | None | This analysis shows the TOC requirements that are NOT in the current analysis baseline. When the entire software set is analyzed, this report area should only contain paragraph headings, information text, and NON software reqs. |
Regression is a web based analysis tool used to help identify potential areas of re-test when software is upgraded. It produces a bi-directional DIFF report between the two versions and submits the resulting files to a detailed analysis which extracts information. There is a regression.html report that is produced and used by the test group to help determine regression test needs.
Analysis | Comment |
Compare | This analysis shows a DIFF between the original source code and the new instrumented code. |
Baseline Files | This analysis shows whole files deleted from or added to the baseline. |
Details | This analysis shows log events and possible regression tests based on keywords analysis and a DIFF. |
Regression Test Keywords Analysis | This analysis shows a set of potential regression tests based on a keyword analysis. |
Regression Test File Analysis | This analysis shows a set of changed files and potential regression tests based on a keyword analysis. |
This tool is started as a DOS application in the IAT main directory (iat-inst). Example syntax is:
where sv-bfw2 is the new software and the regression.html report is placed in the new software analysis directory (sv-bfw2).
These services require a web server such as Apache. They evolved to support regression testing, user responses to customer questions, and reverse engineering for troubleshooting or reuse. Although the services appear to be similar to services found in other applications, they are significantly more powerful because they allow users to define many parameters. These parameters are key to making these services return quick and meaningful results.
Search is used to search the software or other areas of a project data base to respond to questions. Examples include customer comments that certain functions are not called. A quick search of the function name will return the result to the analyst.
Filter is used to extract information from a selected file. It also has the added benefit of subtracting information from an access query. For example, all the final PUI's shown in a data reduction report for a new test run can be subtracted from the PUI's shown in an old test run to yield the delta between the two test runs. Another example is to extract all the requirements based on a keyword such as a CPC, capability, or selected word.
Compare is used to compare test data from two different test runs. It allows a user to filter and mask certain records and minimize the amount of noise so that a meaningful comparison can be performed. The time stamps are also masked. The compare includes an IAT unique DIFF mechanism and an internal GNU DIFF.html or GnuDiff.pdf.
Libraries are project specific. They are links that point to key data. This is accomplished with the establishment of a project portal, such as this IAT portal.
The IAT allows a project to trace its requirements to the source code level. This traceability is not approximate or based on interpretation, but exact. This traceability is shown at all phases of the project development life cycle. To achieve this traceability a process is summarized in the following tables. This process was established and used on a real time embedded application project with a very aggressive schedule and limited funds. This process assumes that DOORS and the IAT are only accessible by the systems group. This limited DOORS and IAT access is a worst case scenario. The ideal situation includes DOORS and IAT access by all project personnel.
Reuse on your Program
Placing IAT On Your Program
IAT Code Review Process Details
Traceability Frequently Asked Questions
How do you handle legacy log events within a reused software collection? Unlike the subjects on the rest of this IAT portal, the reuse topic has not been tested on a real program.
A number of years ago, when the topic was hot I was asked to think about reuse. My thoughts were framed by another company that had been developing essentially the same system since my birth. They kept upgrading and every so often re-achitecting the system and delivering it to different countries for decades. Without realizing it while I was there, they IMHO, had all of the elements of reuse in place and by the time I arrived on the scene it was just ingrained in the culture. One of the more important elements was the fact that they realized reuse was not just about software. It was about all the products including the information products that feed the software. It framed their entire culture.
So IMHO, there are three levels of reuse: (1) course grain, (2) medium grain, (3) and fine grain. Course grain reuse is where you take the people from a project and move them to a new project of a similar nature. The level of reuse is their experience. Medium grain reuse is taking all the information products from one or more related programs and moving them to a new program. I had extensive experience in that area with the company I referenced above. The information products would include the specifications, studies, models, and design documents. Fine grain reuse is what most folks are familiar with when they think of reuse, it is the reuse of the software.
If instrumented software is moved from one project to another project as part of a reuse effort, the question arises: What do you do with the log events? If the organization is mature enough, then its a no brainer because the requirements base is the same. There is one SRDB that supports multiple reuse projects. If the SRDB is fragmented across projects, then the IAT in its infinite wisdom will flag that you have a big problem on your program. The problem is the missing information products that are linked to your reused code. If you duplicated them thats one answer. If you ignored them thats another. Chances are its a combination of the two. So is IAT to blame? No IAT is only the messenger of a process problem. If the process problem exists, what are the alternatives?
There are 3 basic alternatives to the software reuse problem that did not consider all the underlying information products: (1) strip out the log events, (2) translate the log events, or (3) leave the log events in place and enable / disable instrumentation as request by organizations such as test, systems, or software. The organization then embeds the new requirements as log events into the code.
Alternative 1 - Stripping out the log events is easy. However, once they are gone, they are gone forever. Its easy to throw things away, its very difficult to create things. The IAT can be set to just ignore these log events. So this approach is not recommended.
Alternative 2 - Translating the requirements is a possibility. It would mean that a requirements analysis needs to include a reuse analysis so that the requirements can be reconciled. This is not a bad approach and in fact that is what we use to do in that company that practiced reuse for all those decades. Once the translation map is complete the IAT can drill through the software and make the changes.
Alternative 3 - The IAT as part of instrumentation can automatically identify the reused requirements that are in the new project software baseline. A new "reuse" specification can be easily extracted and imported into DOORS as a module named "reuse" from this IAT product.
Why go to the trouble of capturing the requirements associated with reuse? Because all software needs to be accounted for from a requirements perspective. Again, if your program has an SRDB, it is one of those special mission critical programs. That means there should be no "unaccounted for" software. The next issue is the software design information products, such as object interaction diagrams (OIDS), associated with the reused software. Unfortunately IAT currently can not help in that area.
Is there a cost? You bet, but its less than recreating the software. Is it something that must be done? Yes, bidding a reuse program while ignoring the underlying information products shows a lack of understanding the entire process outside the software boundary. Fortunately, unlike in Alternative 2, which is very labor intensive, the IAT can slice through the software and find the reused requirements faster that anyone can read this section. That output can then be fed into DOORS. Call the module "reuse" and declare success. So, this is the recommended approach.
Requirements analysis is used to develop a set of software requirements that can be used for software design. The output of the requirements analysis is captured in System Requirements Data Base (SRDB) using a tool like DOORS. Traditionally, reports have been produced for people as they work different aspects of a project. With the introduction of IAT, the DOORS output is imported into IAT and used by its code analysis and test services.
The most important IAT aspect of the software design is to ensure that peer reviews are held for the code containing the log events. Doing SV testing is not just about running IAT reports and capturing test data for a release. It is also about following process. A key process element is ensuring that software containing log events is subjected to peer reviews and that a check list is filled out as part of the peer review. The check list is documented evidence that the placement of log events was examined by those closest to the problem, the original designers and coders. This is long before any IAT activity starts up on a program. The SV lead should examine these check lists to verify that the item addressing log event placement is checked. Its a good idea for the SV lead to attend initial peer reviews to enforce this concept and make sure that very important check list item is addressed and closed at the review.
With the introduction of IAT, the DOORS output is imported into IAT and used by its code analysis and test services. The schema should include the elements found in any good SRDB. A unique view should be created to support export to IAT. The format of the report should include the PUI, Req Text, CPC and optionally the phase of delivery.
The picture looks busy, but the bottom line is tester gather raw test data in TXT files. The TXT files are subjected to data reduction and HTML files are produced as the test is being performed. There is a TXT and companion HTML file for each test case. At the conclusion of a test, Req Reduction processes all the TXT files and produces a single big picture HTML report of the entire test run. This same service MINES the test case HTML reports and produces a single big picture HTML report of each test case. There are services such as compare, search, and filter which are used to perform ad hoc analysis of the HTML reports. In the end, the job is to hunt for discrepancies and account for all the req's.
There are IAT services and files containing code analysis reports, test data, test reports. As the project unfolds with multiple internal engineering releases, templates are developed. These templates are extremely valuable. The most valuable templates are the TXT files containing the raw test data. These files contain comments from the testers and essentially become scripts for running the tests. As the project unfolds, the team starts to refer to the TXT files as scripts. An IAT service is available to scrub all scripts of any test data and leave the test comments behind for use in a next round of testing. There are also HTML templates that are left behind from previous test runs. These HTML templates are usually the result of some ad hoc analysis involving a search or a compare.
The directory structure is based on having a holding place for all the test (iat-test), a holding place for each release (REL x.x), a special place for IAT test data (logev), and a special place for analysis (logevents-analysis). The only name that is fixed is the "logev" directory.
Staff |
Activities |
Eng Mgmt | Get buy in from all the stakeholders - Customer, PMO, Software, Systems, Test. |
Eng Mgmt | Identify an IAT project POC who will be responsible for its operation and products. |
IAT expert & project POC | Move a generic form of IAT to the unique program network area (about
1 day). This includes getting ISS to install Apache and AOL press on your
IAT workstations. Although not initially required, these applications will
be needed on programs once IAT is 100% in place and providing significant
services.
If you don't have software yet, then you just need to understand what to do with your SRDB, and your software code once it starts to get developed. That is a simple meeting with your staff where "we" plan for IAT and a careful read of the IAT web portal. |
IAT expert & project POC | Run some project software files through code analysis and asses results (2-3 days). This includes getting the project POC up to speed on running very rudimentary IAT services and trying to communicate the IAT big picture to the project POC. |
systems, software, test | Determine if IAT will be used for code analysis, test, or both code analysis and test. |
systems, software, test | Identify what requirements will be instrumented - security critical, mission critical, system, software, all processors, some processors, etc. If this is an existing program, work only with the new requirements associated with the next release. Do not attempt to back fill previous releases for the first IAT based release. |
systems, software, test | Identify if you intend to capture display messages intermixed with the log events. If so, determine if it will be via instrumentation or mixing of display messages with log events on the same interface. |
systems, software, test | Identify how the log events will be captured - JTAG emulator port, RS-232, internally within a PC bus, etc. |
Eng Mgmt | Identify a Software POC who will be responsible for getting the log event interface to a static text file up and running - tester copies and pastes from emulator window or a mechanism that writes to a file or something else. |
IAT POC | Distribute Plan for embedding IAT on the program Outline-for-implementing-IAT.doc. |
Staff |
Activities |
customer | Customer provides a list of features. |
systems, software | A-type specification is written using MSword. The engineering data is captured via round table discussions while reviewing the OPS specification. |
systems | The OPS specification is imported into DOORS. |
systems with some software participation | Round table discussions are held to determine potential CSCI's. A decision is made to have x SRS's. One for each "computer" and one for when the device has no application software (BOOT state). |
systems | The requirements are extracted and allocated to hardware and software SRS's as a first pass. |
systems | An OPS DOORS table report is produced showing allocations to hardware and software SRS's. |
software | Requirements allocations are reviewed and modified as needed. |
systems | DOORS is updated to reflect allocation changes and a new report is produced and distributed. |
Staff |
Activities |
software | SRS's are written using MSword. |
systems | SRS's are imported into DOORS. |
systems | OPS and SRS requirements are linked in DOORS as a first pass. |
systems | Two types of DOORS table reports are produced to show allocations and highlight orphans and childless parents. |
software with some systems participation | Reports are reviewed to identify missing requirements and misallocations. Modifications are made. |
systems | DOORS is updated to reflect the changes and new reports are produced and distributed. |
systems | TEO and TOC is written using MSword. |
systems | TOC is imported into DOORS. |
Staff |
Activities |
systems | OPS, SRSs, and TOC requirements are allocated to system capabilities (threads which become test cases). |
systems | OPS, SRSs, and TOC requirements are allocated to test methods (I, A, D, T). |
software | SRSs and TOC requirements are allocated to CPCs. A CPC is a collection of software source files. They are a directory in the software source. They are a UNIT which are referenced in the SDFs. |
systems | DOORS table reports are produced showing the requirements and allocations to CPCs, capabilities, and test methods. |
software | CPCs allocation updated several times. Note that this process never achieves closure. The final allocation is always in the as built software. However, a best effort should be put forward in this activity to minimize questions in the back end of the program when everything is reconciled. |
systems | DOORS table reports regenerated with each CPC allocation change request. Requirements reports are generated for use by the IAT. |
Staff |
Activities |
software | TOC requirements are physically entered into the CPC software source code as a comment statement where the requirement is satisfied. These are the log events. This is the actual as built final allocation results. |
software | Peer review is announced. Announcement includes where the software is located. |
systems | Software is accessed and the IAT is run against software to be reviewed. E-mail generated summarizing the findings and pointing to the detailed IAT reports. |
software, IAT reports, and initially a physical presence by systems | Peer review held to verify several items including log event placement. |
Test Director with input from TD and software leads | Code walkthrough with the customer is announced. |
systems | IAT is run against software to be reviewed. E-mail generated summarizing the findings and pointing to the detailed IAT reports. |
software | Walk through held to verify several items including log event placement. |
systems | CD of reports sent to Government |
Staff |
Activities |
software, systems | Create a C program that supports emulator log event dumping. |
software | Make changes to ensure that instrumentation works. This may include moving software in the memory map. |
systems for dry runs and CM for formal testing | Instrument the software using the IAT. |
systems with software review if they have the time | Review IAT reports to determine if requirements covered in the current build are as expected. Use the IAT to support code familiarization. |
systems | Run tests IAW with the test cases, which are the system threads and capabilities. Capture the log events from the emulator. Run data reduction reports for each test case. |
systems | Examine the final master report for requirements and log events coverage. At the conclusion of the tests execute the reqreduction.pl tool to extract all the unique log events from the entire test activity. Run the labreduction.pl tool with all the options enabled. |
systems | Examine the test results associated with each test thread and the master report. |
systems | If this is the formal test and not a dry run, incorporate findings into the test results report. |
Good, bad, or indifferent this is what happened with DOORS on Project-O.
Schema for the Project-O DOORS database
Formal Modules
Link Modules
Other Modules that you see
The reasons for the 3 CSCI's are:
Attributes
Some of the attributes started as pull down selection boxes. All the attributes are now text fields. We did not know how to perform global inserts for pull down boxes and doing those items 1 at a time took to long. Since DOORS training I have a feeling that may be possible with the pull down selection boxes, but have not tried it... The attributes are:
Preparing the Import and Importing
I took a quick look at the current PPP SRDB (system Req's data Base) and noticed that there are multiple req's in the same object (row). On Project-O we imported many times until we figured out what needed to be done.
Step A was to go into the word document and do a global search and replace so that each period was followed by a paragraph mark. The search pattern was <. > note the space and the replace pattern was <. ^p>. DOORS will place windows lists in a separate object. So remove any periods from the lists. There was a time when we wanted the lists to go into a single object, but we could never figure out how to do it without significant manual merging once in DOORS (our new version of DOORS).
We then experimented with different import approaches (rtf, txt, deleting figures and table, etc). The bottom line is open the document in Msword and press the icon to export to DOORS. This will grab everything, including the pictures and tables. It is the cleanest import to DOORS, capturing the paragraph headings and non heading stuff. DOORS assigns a unique number to each object (row) which is different than the "paragraph" number that DOORS assigns with other import methods. Project-O used the object number rather than the DOORS "paragraph" number. The paragraph number is not absolute while the object number is, so there can never be confusion in the future. Also, the DOORS paragraph number is ugly for a human, too much noise. Bottom line is use the OBJECT number for the Req PUI (Project Unique Identifier), IMHO.
Generating Reports and Exporting Back to Word
Use the wizard. Do not try to create custom reports by modifying DXL. We tried that on Project-O, you might as well stick tooth picks in your eyes. Use the wizard and learn to live with the views the wizard provides. We tried to create a few standard views, but the wizard is so fast, that the standard views were used primarily as a guide to remember what the last "good" report looked like...
Do not include the tables in the report if you plan to export back to word. It takes way too long. Although you can link to each cell in a table, we never did that, again because of the huge amount of time in creating those links. By time I do not mean DOORS button pushing, but gray matter time. Someone needs to think about the allocation. After we reflected on what benefit might arise if we linked to each table cell, the bottom line answer was NO. We also considered linking to the table references at the start of each table. It never happened... again to what benefit.
So the Project-O SRDB is English phrases, no pictures or tables even though the original stuff is in the database. We did try to make each phrase stand alone. We created and published a requirements style guide which included instructions and samples on writing good clean req's text. The SRS's and OPS are not bad, considering everything else... I would recommend finding this style guide and using something similar on PPP. It roots are Al Starr and company.
Attribute Versus Linkable Object
The question always arises, should something be an attribute or a linkable object. The best way to answer that is by example and some history. In the old days before COTS SRDBs organizations would try to show traceability using a spread sheet or a word processing table. Those who had worked with those information products know that the view is only in one direction, the direction the table was built. For example, if req's were allocated to test cases, the product would list the test cases in the left column and the requirements would be in the right column. But, for each row, while there would be one entry for the test case, there would be multiple entries for the requirements. The dilemma arises if the user wants the information product sorted by requirements. What would happen is the rows would be duplicated using a copy and paste operation and the cell with the multiple entries would be manually parsed. Now here is the bug a boo; the example is test case versus requirement, but the information product would always start out as requirement versus test case. So the manual parsing was always required if you wanted all the requirements associated with a test case.
The advantage of an SRDB is, if you use the linkable object method, the SRDB will create bi-direction views without any manual changes. So, if you are silly enough to make your test cases an attribute, then you will be forced to use the old manual approach of parsing your information for a view that is based on test cases. Essentially you have demoted the wonderful power of an SRDB to the state of an office automation product like a spreadsheet or word processing information product. Further, you have forever lost any ability to show the relation ship to any other information that is at a lower level. For example, you can not create a view that shows all your test cases linked to lower level requirements like your SRSs. You also can not show links to other information like your CPCs. Finally, you can't see the chain that stretches from CPC, to SRS, to A-level Spec, to system test.
So, anything that needs to show bi-directional views should be based on a linkable object. Also, if the object is related to other items in the database, the linkable chain path should be preserved. In general, you really should have a very good argument for reducing something to an attribute, because as an attribute, it has no ability to be viewed in alternative ways. It is stuck in a word processing or spreadsheet table like view.
The IAT supports both peer reviews and code walk throughs. The difference between a peer review and a code walk through is the amount of software under review. During a peer review the collection of software being reviewed is a small subset of the application. During a code walk through the collection of software being reviewed is a working standalone application. The code walk through has been merged and is coincident with a software release on Project-O.
IAT Based Peer Review Process
IAT Based Code Walk Through
Analysis Process
The IAT Test Process contains the following major blocks: Requirements, test case development, test run preparation, test execution, test wrap up, test analysis, and test complete activities.
Test Requirements The requirements need to
be imported from DOORS into IAT. This is a manual process using MSword to
convert the file format into a flat ASCII text file and a .html file. The
flat ASCII text file is accessed by the IAT internals and the .html file
is used by the IAT links.
Test
Case Development Test case development begins with a traditional MSword
document, but it is augmented with test scripts that are save and maintained
as separate stand alone text files.
Test Run
Preparation Test preparation includes setting up the test directory and
populating the directory with IAT services software and test scripts.
Test
Execution Test execution is dependent upon the level of sophistication
of the test environment. This environment can be simple where testers copy
and paste log events into a text file or complex where the machinery writes
directly to a disk file and intermixes log events with target machine display
information.
Test
Wrap Up At the conclusion of a test battery, there is an accounting that
occurs which reconciles all the requirements found in the DOORS data base.
This accounting is accomplished by executing unique IAT services.
Test
Analysis Test analysis is a grueling activity that involves looking at
the log events from multiple perspectives. The IAT includes a number of service
to ease that process and these include filtering, searching, comparing, and
keyword coloring.
Test
Complete At the end of the test analysis everything needs to be organized
and put away for future reference and possible reuse. This also includes
transferring metrics data to external reports and updating test catalog web
pages.
Test Requirements - Create a MSword version of the DOORS report.
Test Requirements - Convert the DOORS report to formats that IAT understands - flat ASCII text file
Test Requirements - Convert the DOORS report to formats that IAT understands - .html file
Test Requirements - Transfer the converted DOORS reports to IAT areas
Test Case Development - Writing test procedures
Test Run Preparation - All
Test Run Preparation - First Time
Test Run Preparation - There is a previous release to start from in this release
Test Execution - Manual transfer of log events to test script files
Test Execution - Semi Automatic writing of log events to test script files
Test Execution - Automatic writing of log events to test script file
Test Execution - Complete
Test Wrap Up - Requirements Accounting
Test Complete - This includes cataloging the test data and transferring metrics data
1. What
is a CPC and a UNIT?
In my view a unit was and still is the lowest common denominator in software. This translates to a subroutine / function call. Occasionally I will back off and consider a unit a source file which may include from 1 to 6 functions / subroutines. This view was formed over 20 years ago while working with an outfit that was writing the next generation software processes which eventually became MIL-STD-2167. It was in the day when Yourdon ruled the nest. A CPC was a collection of units. You could have lower level CPCs within upper level CPCs. A CSCI was a collection of CPCs.
As I have made my trek across the country working for many different types of organizations the definition of a UNIT was always mired in obfuscation and confusion while the software team would try to ignore what was needed for traceability and focus on providing software code. In the end it would never matter because traceability to the UNIT never really occurred. There was no reasonable mechanism to capture it and show it with or without an SRDB (DOORS). The links would eventually break down shortly after the CSCI level. In fact it was in the more enlightened organizations where traceabiltiy would be reasonable at the CSCI level. The test organization would be left to pick up the pieces and attempt to imply traceability while creating verification cross reference matrices.
So traceability could range from course grain to fine grain traceability. Fine grain traceability would take it to the unit level. Course grain traceability would take it to the CSCI level. The cost associated with providing fine grain traceability would far out weight the needs of providing basic functionality and performance, so it would never happen.
In Project-O a UNIT was a software folder on the NT Network. It was also a representation of the SDF. It was a collection of multiple C , ASM, and H files. Examples include SES, TRF, LDI, etc. However, even though the software team defined a unit to be less than the common denominator, the traceability provided was fine grain traceability and went all the way down to the function call in 99% of the cases. This was accomplished with the IAT and the defined process. The remaining 1% would occur at the source file level if the IAT was not successful in locating the associated function.
Here is the bug a boo. Hardware folks have no problem understanding the concept of a unit. For hardware, a unit is a thing with a part number, and everything has a part number, even the lowly resister. Software people do not understand that simple concept. So when you say every function in every file should have a part number, they get very upset. Why are units and the nomenclature of units so important for some folks? It is all for traceability. Traceability can only exist if you can identify a thing. The lower the level of the thing you identify, the closer you get to the area of interest once an SRDB trace analysis is performed.
2.
Where do I show traceability, in DOORS or IAT?
In Project-O traceability was down to the CPC level in DOORS and down to the function call level (UNIT as it should be defined) in IAT. IAT also shows the CSCI and each CPC where the CPC, is a WinNT directory, is an SDF, is what software called a UNIT. The traceability is approximate in DOORS since the CPC req's allocation was never updated after the software was built and released. The traceability is final and exact in IAT. It is an as built view with almost no room for error as the actual designer performed the final allocation.
3.
Should I go back and update DOORS with the fine grain traceability outputs
of IAT?
You have to ask yourself the all important question - what is the of purpose showing the traceability to the UNIT level.
If the purpose is to provide some assurance that the software does what it is suppose to do and the program includes many third party reviewers, then showing traceability in the IAT is probably the most effective presentation. It provides links to the source code and the requirements. It provides listings that are easy to read and follow with the requirements summarized as embedded comments very close to where they are satisfied.
If the purpose is to manage large numbers of subcontractors and build a repository that may last a few decades, then the SRDB should be maintained. This means that the SRDB should be updated frequently including entering updates from the outputs of the IAT.
4.
Ok, so how does the IAT show the traceability to the UNIT level?
A picture is worth a thousand words. The output of the IAT is reports. The power of the IAT is not in its absolute "here it is nature", but in its Internet roots. This means that the output is easily formatted to fit any need as it arises.
Currently the traceability to the UNIT level is captured in any of the following reports which are produced during instrumentation: s_codegen.html, s_le-summary.html, or s_simulate.html. The s_le-summary.html report is probably the most effective analysis report. It only shows the software files and functions which contain req's and has links to the source code and original req text.
If you want to update the SRDB, the IAT has a report, s_srdbupdate.html, that was specifically created for this activity. Its format is such that a data entry person can update the SRDB. It also provides table views of req's vs c-function and c-file sorted by either req number of path/file name.
5.
My head is about to explode, so what should I do?
Keep it simple and follow the Project-O approach to traceability. Capture the front end as best as you can in DOORS realizing that eventually you will cut the umbilical and transfer to the IAT.
If you must update DOORS, then perform the update after the first release of the software. The output from the IAT will be real and complete for that release. If you do not like the current report formats, consider creating a new IAT report to streamline the DOORS update process.
Consider your stakeholders. What tools and reports are they really examining. If no one has access to your SRDB but everyone has access to your source code and IAT reports, then guess what everyone will use for their respective jobs?
Keep in mind that almost everyone has unit development folders and they have req's in those folders, however, that does not ensure "modern up down traceability" goals as can be provided with an SRDB. Populating the SRDB is 1 problem, but the other problem is actually ensuring that you have traceability without orphans and childless parents. That analysis can only occur with an SRDB. That is why DOORS exists. So after you populate your SRDB, don't forget to produce those reports and have the staff fill the holes. Again, no one goes much below the CSCI level. The really good organizations make it to the CPC level, the really great organizations make it to the unit level in the SRDB. I have never seen a really great organization, so for me its a myth and a challenge which IAT attempted to address.
6.
I am not using IAT so how do I show traceability to the UNIT level?
Good Luck. Let folks know what you did :)
7.
Why are we just talking about software traceability?
Traceability and the lack of it always boils down to the same thing, software.
Thanks to the explosion of hardware performance, systems have all become software intensive. Software has the unique problem of dealing with millions of things (LOC) which means that no one can really show traceability in the end except for the individual designers and implementors. However when the topic of traceability surfaces, its irrelevant to the typical software organization tasked with getting basic functionality out the door.
There in lies the dilemma. Too much to do and not enough time and or resources to do it. Further, traceability to the software UNIT level is very excruciating by its very nature of being so fine of a level of detail. So no one wants to do it even if time or resources could materialize.
IAT is based on these realizations and is a mechanism that extracts the traceability from the software team in the least intrusive way possible, primarily by piggy backing on their existing activities. Tailoring and Metrics Frequently Asked Questions
1. The IAT uses pattern
searches to preform all of its services so think in terms of
patterns. These pattern searches look for things to note or ignore.
The IAT is not a complex device. Each line of code is examined only once,
by subjecting it to these pattern searches. These pattern searches form the
rule sets of the IAT. Some complex rules have look ahead pattern search
capabilities. This means that the choice of key words and naming conventions
is important.
2. Do not use TBD for
all place holders. Differentiate the TBDs. For example TBF = future,
TBS = To be specified, TBC = To be coded, TBD = to be determined, etc. Or
just use TBD-1, TBD-2, TBD-n.
3. For all new code,
use a naming convention for security critical variables. This will
reduce the noise in the error and hamming distances reports. Get a list of
the security critical variables that do not follow the naming conventions
and incorporate them into the IAT analysis.
4. Do a peer review
ASAP so that the process can be started on a simple case.
5. Do not bypass peer
reviews as the next phase of a project starts. For example, after
an initial release to the field. Take the opportunity to get IAT peer review
reports on the new features being added to the device and as a minimum review
the requirements captured in the software under review.
6. Use the regression
service to perform DIFFs on software that goes to the factory. Do
not let software go to the factory without using this service. Make sure
it is reviewed and not just mechanically generated. Examine the recommended
regression tests carefully.
7. Incorporate the
trivial cosmetic changes ASAP. This will reduce the noise in the
reports and allow everyone to focus on the real problems while feeling good
about having some initial success. There is a tendency to delay these cosmetic
changes. If that is the case on your program, then disable these features
in the IAT. However, be careful. Bad headers are bad headers, and if they
show no history, then its valid to assume you have lost CM control on those
modules.
8. During test trust
your log events. If a test fails to complete as expected do not try
to understand the problem by resorting to the emulator or multiple changes
to the test scenario and configuration. Trust the IAT and its log events.
Experience has shown that IAT identifies where the failure occurs without
question. That information has always pointed to a software or configuration
problem that is quickly understood (within seconds of seeing the log events).
9. Do not add missed
log events or make code analysis changes as a special activity.
Experience has shown that between releases much of the software is revisited
as part of enhancements and problem correction. Add the log events as part
of the process of doing something else to a module. So there should be a
"to do list" each time a module is opened for change. That "to do list" should
include checking the missing log events list. This also applies to the software
code analysis changes. The "to do list" should include fixing headers and
other outputs from the code analysis reports. BUT do not delay or stonewall
the work that needs to be performed. Experience has shown there is usually
no excuse for not putting the changes in the next release.
10. In the past the
test team set hundreds of breakpoints in the software giving them
control. With IAT the software team sets the "logical break points
of the past" as log events. Do not prevent the testers from adding, changing,
and moving log events during test preparation. Preventing this simple task
takes control away from the test team that existed when they set hundreds
of break points in the software. Adding and moving 10-20 logevents per release
is trivial and will eliminate significant tension between management, software
staff, and the test team needing this critical service.
11. Consider logging
content on your program ASAP. IAT has all the hooks to log content.
The only issue is your unique approach to getting the log events and log
content out your interface. Content is a key element that further reinforces
placement of the log events. If the content is not right, you know there
is a log event placement problem. In the past testers would search for the
elusive breakpoint by looking at the related content. This would verify the
proper placement of the breakpoint.
12. Log events in the
code is not just about automated testing. Placement of log events
is also used to verify requirements in the software. The code analysis reports
include visualizations that quickly link software and requirements. Placement
of log events also can be used to perform traditional breakpoint testing
where the log event identifies ahead of time the logical location of a
breakpoint. In fact the very first IAT based testing was based on both
breakpoints (using log events as the breakpoint locators) and the automated
output of a log event to a file.
The framework is implemented in PERL. One of the advantages of PERL is that variable length or type is not relevant. So a single variable can slurp in an entire document and the manipulations on that document can be performed by manipulating that single variable.
The heart of this tool is the regular expression. Regular expressions are used to instrument the code, search for keywords, find bad patterns, correlate the lab test data with the original source code comment statements, and correlate the comment statements with the original TOC specification. The tool access all the software in a directory tree one file at a time. The file is moved into an array and into a single variable. Some of the checks are performed against the entire file in the single variable (e.g. classification checks). The remaining checks are performed on each individual line in the array (e.g. Show ham values). In some cases multiple items in the array are checked (e.g. consecutive logevents). Everything else is mundane and mechanical.
The reports are HTML formatted. Normally in PERL, if executed from a web server, the output is sent to STDOUT which is the web server. STDIN ordinates from the clients browser via the web server. STDERR goes to an error file defined by a configuration directive. In the absence of a web server the STDOUT is redirect to a file with the following sequence:
# set to save HTML page status
;
open(DATA, ">$analysis/$report");
select(DATA)
From this point on, all output is sent to the disk following whatever is stored in $analysis/$report. When the instruction:
close(DATA);
is executed, STDOUT becomes the normal STDOUT, which is the DOS window if the program was started from DOS or the web server if the program was started from a browser.
There is a "main" at the top of the 3 programs. The main calls the subprograms that form this framework. To enable the script operations from the DOS window the "main" area of each of the 3 programs was encapsulated as a sub program:
sub logevents_main{
all the main junk
}
removing this encapsulation or just adding a call at the top of each program will allow these 3 programs to operate in stand alone mode, without the calls from the scripts / templates.
The data stored on disk is read in using two approaches. The first approach reads the file into an array. The second approach reads the file into a variable. Occasionally, data read in as an array is also passed to a variable.
The reason for these two approaches to internally storing the file data is associated with the regular expression processing. If the program is looking for a certain pattern across an entire file, the variable is used in the regular expression. If on the other hand a single line entry needs to be examined, then the array is sequenced and a regular expression processes all the array elements. Sometimes its possible to perform regular expression processing on the non-array form, but regular expressions are not trivial, and for simplicity, and the desire to get consistent results, the array method is used.
In an RE there are plenty of special characters, and it is these that both give them their power and make them appear very complicated. It's best to build up your use of REs slowly; their creation can be something of an art form.
Here are some special RE characters and their meaning
. # Any single character except a newline ^ # The beginning of the line or string $ # The end of the line or string * # Zero or more of the last character + # One or more of the last character ? # Zero or one of the last character
and here are some example matches. Remember that should be enclosed in /.../ slashes to be used.
t.e # t followed by anything followed by e # This will match the # tre # tle # but not te # tale ^f # f at the beginning of a line ^ftp # ftp at the beginning of a line e$ # e at the end of a line tle$ # tle at the end of a line und* # un followed by zero or more d characters # This will match un # und # undd # unddd (etc) .* # Any string without a newline. This is because # the . matches anything except a newline and # the * means zero or more of these. ^$ # A line with nothing in it.
There are even more options. Square brackets are used to match any one of the characters inside them. Inside square brackets a - indicates "between" and a ^ at the beginning means "not":
[qjk] # Either q or j or k [^qjk] # Neither q nor j nor k [a-z] # Anything from a to z inclusive [^a-z] # No lower case letters [a-zA-Z] # Any letter [a-z]+ # Any non-zero sequence of lower case letters
At this point you can probably skip to the end and do at least most of the exercise. The rest is mostly just for reference.
A vertical bar | represents an "or" and parentheses (...) can be used to group things together:
jelly|cream # Either jelly or cream (eg|le)gs # Either eggs or legs (da)+ # Either da or dada or dadada or...
Here are some more special characters:
\n # A newline \t # A tab \w # Any alphanumeric (word) character. # The same as [a-zA-Z0-9_] \W # Any non-word character. # The same as [^a-zA-Z0-9_] \d # Any digit. The same as [0-9] \D # Any non-digit. The same as [^0-9] \s # Any whitespace character: space, # tab, newline, etc \S # Any non-whitespace character \b # A word boundary, outside [] only \B # No word boundary
Clearly characters like $, |, [, ), \, / and so on are peculiar cases in regular expressions. If you want to match for one of those then you have to preceed it by a backslash. So:
\| # Vertical bar \[ # An open square bracket \) # A closing parenthesis \* # An asterisk \^ # A carat symbol \/ # A slash \\ # A backslash
and so on.
As was mentioned earlier, it's probably best to build up your use of regular expressions slowly. Here are a few examples. Remember that to use them for matching they should be put in /.../ slashes
[01] # Either "0" or "1" \/0 # A division by zero: "/0" \/ 0 # A division by zero with a space: "/ 0" \/\s0 # A division by zero with a whitespace: # "/ 0" where the space may be a tab etc. \/ *0 # A division by zero with possibly some # spaces: "/0" or "/ 0" or "/ 0" etc. \/\s*0 # A division by zero with possibly some # whitespace. \/\s*0\.0* # As the previous one, but with decimal # point and maybe some 0s after it. Accepts # "/0." and "/0.0" and "/0.00" etc and # "/ 0." and "/ 0.0" and "/ 0.00" etc.
Do case-insensitive pattern matching.
If use locale
is in effect, the case map is taken from the current
locale.
Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,
Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match.
The /s and /m modifiers both override the $*
setting. That is,
no matter what $*
contains, /s without /m will force ``^'' to
match only at the beginning of the string and ``$'' to match only at the
end (or just before a newline at the end) of the string. Together, as
/ms
, they let the ``.'' match any character whatsoever, while
yet allowing ``^'' and ``$'' to match, respectively, just after and just
before newlines within the string.
Extend your pattern's legibility by permitting whitespace and comments.
These are usually written as ``the /x
modifier'', even though
the delimiter in question might not actually be a slash. In fact, any of
these modifiers may also be embedded within the regular expression itself
using the new (?...)
construct. See below.
The /x
modifier itself needs a little more explanation. It tells
the regular expression parser to ignore whitespace that is neither backslashed
nor within a character class. You can use this to break up your regular
expression into (slightly) more readable parts. The #
character
is also treated as a metacharacter introducing a comment, just as in ordinary
Perl code. This also means that if you want real whitespace or #
characters in the pattern (outside of a character class, where they are
unaffected by /x
), that you'll either have to escape them or
encode them using octal or hex escapes. Taken together, these features go
a long way towards making Perl's regular expressions more readable. Note
that you have to be careful not to include the pattern delimiter in the
comment--perl has no way of knowing you did not intend to close the pattern
early.
The patterns used in pattern matching are regular expressions such as those supplied in the Version 8 regex routines. (In fact, the routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)
In particular the following metacharacters have their standard egrep-ish meanings:
\ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character class
By default, the ``^'' character is guaranteed to match at only the beginning
of the string, the ``$'' character at only the end (or before the newline
at the end) and Perl does certain optimizations with the assumption that
the string contains only one line. Embedded newlines will not be matched
by ``^'' or ``$''. You may, however, wish to treat a string as a multi-line
buffer, such that the ``^'' will match after any newline within the string,
and ``$'' will match before any newline. At the cost of a little more overhead,
you can do this by using the /m
modifier on the pattern match
operator. (Older programs did this by setting $*
, but this practice
is now deprecated.)
To facilitate multi-line substitutions, the ``.'' character never matches
a newline unless you use the /s modifier, which in
effect tells Perl to pretend the string is a single line--even if it isn't.
The /s modifier also overrides the setting of
$*
, in case you have some (badly behaved) older code that sets
it in another module.
The following standard quantifiers are recognized:
* Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is treated as a regular
character.) The ``*'' modifier is equivalent to {0,}
, the ``+''
modifier to {1,}
, and the ``?'' modifier to {0,1}
.
n and m are limited to integral values less than 65536.
By default, a quantified subpattern is ``greedy'', that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a ``?''. Note that the meanings don't change, just the ``greediness'':
*? Match 0 or more times +? Match 1 or more times ?? Match 0 or 1 time {n}? Match exactly n times {n,}? Match at least n times {n,m}? Match at least n but not more than m times
Because patterns are processed as double quoted strings, the following also work:
\t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (think of a PDP-11) \x1B hex char \c[ control char \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E
If use locale
is in effect, the case map used by
\l
, \L
, \u
and \U
is
taken from the current locale.
You cannot include a literal $
or @
within a
\Q
sequence. An unescaped $
or @
interpolates the corresponding variable, while escaping will cause the literal
string \$
to be matched. You'll need to write something like
m/\Quser\E\@\Qhost/
.
In addition, Perl defines the following:
\w Match a "word" character (alphanumeric plus "_") \W Match a non-word character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character
A \w
matches a single alphanumeric character,
not a whole word. To match a word you'd need to say \w+
. If
use locale
is in effect, the list of alphabetic characters generated
by \w
is taken from the current locale. You may use
\w
, \W
, \s
, \S
,
\d
, and \D
within character classes (though not
as either end of a range).
Perl defines the following zero-width assertions:
\b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string \G Match only where previous m//g left off (works only with /g)
A word boundary (\b
) is defined as a spot
between two characters that has a \w
on one side of it and a
\W
on the other side of it (in either order), counting the imaginary
characters off the beginning and end of the string as matching a
\W
. (Within character classes \b
represents backspace
rather than a word boundary.) The \A
and \Z
are
just like ``^'' and ``$'', except that they won't match multiple times when
the /m modifier is used, while ``^'' and ``$'' will match at every internal
line boundary. To match the actual end of the string, not ignoring newline,
you can use \z
. The \G
assertion can be used to
chain global matches (using m//g
), as described in Regexp Quote-Like
Operators.
It is also useful when writing lex
-like scanners, when you have
several patterns that you want to match against consequent substrings of
your string, see the previous reference. The actual location where
\G
will match can also be influenced by using pos() as an lvalue.
When the bracketing construct ( ... )
is used, \<digit>
matches the digit'th substring. Outside of the pattern, always use ``$''
instead of ``\'' in front of the digit. (While the \<digit> notation
can on rare occasion work outside the current pattern, this should not be
relied upon. See the WARNING below.) The scope of
$<digit> (and $`
, $&
, and
$'
) extends to the end of the enclosing
BLOCK or eval string, or to the next successful pattern
match, whichever comes first. If you want to use parentheses to delimit a
subpattern (e.g., a set of alternatives) without saving it as a subpattern,
follow the ( with a ?:.
You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parentheses before the backreference. Otherwise (for backward compatibility) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
$+
returns whatever the last bracket match matched.
$&
returns the entire matched string. ($0
used
to return the same thing, but not any more.) $`
returns everything
before the matched string. $'
returns everything after the matched
string. Examples:
s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
if (/Time: (..):(..):(..)/) { $hours = $1; $minutes = $2; $seconds = $3; }
Once perl sees that you need one of $&
, $`
or $'
anywhere in the program, it has to provide them on each
and every pattern match. This can slow your program down. The same mechanism
that handles these provides for the use of $1, $2, etc., so you pay the same
price for each pattern that contains capturing parentheses. But if you never
use $&, etc., in your script, then patterns without capturing
parentheses won't be penalized. So avoid $&, $', and $` if you can, but
if you can't (and some algorithms really appreciate them), once you've used
them once, use them at will, because you've already paid the price. As of
5.005, $& is not so costly as the other two.
Backslashed metacharacters in Perl are alphanumeric, such as \b
,
\w
, \n
. Unlike some other regular expression languages,
there are no backslashed symbols that aren't alphanumeric. So anything that
looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a
literal character, not a metacharacter. This was once used in a common idiom
to disable or quote the special meanings of regular expression metacharacters
in a string that you want to use for a pattern. Simply quote all non-alphanumeric
characters:
$pattern =~ s/(\W)/\\$1/g;
Now it is much more common to see either the quotemeta()
function
or the \Q
escape sequence used to disable all metacharacters'
special meanings like this:
/$unquoted\Q$quoted\E$unquoted/
Perl defines a consistent extension syntax for regular expressions. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses (this was a syntax error in older versions of Perl). The character after the question mark gives the function of the extension. Several extensions are already supported:
/x
switch is used to enable whitespace formatting, a simple #
will
suffice. Note that perl closes the comment as soon as it sees a
)
, so there is no way to put a literal )
in the
comment.
@fields = split(/\b(?:a|b|c)\b/)
@fields = split(/\b(a|b|c)\b/)
?
and :
act as flags modifiers,
see (?imsx-imsx)
. In particular,
/(?s-i:more.*than).*million/i
/(?:(?s-i)more.*than).*million/i
/\w+(?=\t)/
matches a word followed by a tab, without including
the tab in $&
.
/foo(?!bar)/
matches any occurrence of ``foo'' that isn't followed
by ``bar''. Note however that lookahead and lookbehind are
NOT the same thing. You cannot use this for lookbehind.
/(?!foo)bar/
will not do what you want. That's because the
(?!foo)
is just saying that the next thing cannot be ``foo''--and
it's not, it's a ``bar'', so ``foobar'' will match. You would have to do
something like /(?!foo)...bar/
for that. We say ``like'' because
there's the case of your ``bar'' not having three characters before it. You
could cover that this way: /(?:(?!foo)...|^.{0,2})bar/
. Sometimes
it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
/(?<=\t)\w+/
matches a word following a tab, without including
the tab in $&
. Works only for fixed-width lookbehind.
/(?<!bar)foo/
matches any occurrence of ``foo'' that isn't
following ``bar''. Works only for fixed-width lookbehind.
code
is not interpolated. Currently the rules to determine where
the code
ends are somewhat convoluted.
code
is properly scoped in the following sense: if the assertion
is backtracked (compare Backtracking), all the changes introduced after
localisation are undone, so
$_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to non-localized # location. >x;
$res = 4
. Note that after the match $cnt
returns to the globally introduced value 0, since the scopes which restrict
local statements are unwound.
code
is put into variable $^R. This happens
immediately, so $^R can be used from other (?{
code })
assertions inside the same regular expression.
use
re 'eval'
pragma is used (see the re manpage), or the variables contain
results of qr()
operator (see qr/STRING/imosx).
$re = <>; chomp $re; $string =~ /$re/;
(?{})
was introduced, it was considered bad to add
new security holes to existing scripts.
use re 'eval'
does
not disable tainting checks, thus to allow $re
in the above
snippet to contain (?{})
with tainting enabled, one
needs both use re 'eval'
and untaint the $re.
pattern
would match if anchored at the given
position, and only this substring.
^(?>a*)ab
will never match, since
(?>a*)
(anchored at the beginning of string, as above) will
match all characters a
at the beginning of string,
leaving no a
for ab
to match. In contrast,
a*ab
will match the same as a+b
, since the match
of the subgroup a*
is influenced by the following group
ab
(see Backtracking). In particular, a*
inside
a*ab
will match fewer characters than a standalone
a*
, since this makes the tail match.
(?>pattern)
may be achieved by
(?=(pattern))\1
a+
. The following \1
eats the matched string, thus making a zero-length assertion into an analogue
of (?>...)
. (The difference between these two constructs
is that the second one uses a catching group, thus shifting ordinals of
backreferences in the rest of a regular expression.)
m{ \( ( [^()]+ | \( [^()]* \) )+ \) }x
(.+)+
is
doing, and (.+)+
is similar to a subpattern of the above pattern.
Consider that the above pattern detects no-match on
((()aaaaaaaaaaaaaaaaaa
in several seconds, but that each extra
letter doubles this time. This exponential performance will make it appear
that your program has hung.
m{ \( ( (?> [^()]+ ) | \( [^()]* \) )+ \) }x
(?>...)
matches exactly when the one above does
(verifying this yourself would be a productive exercise), but finishes in
a fourth the time when used on a similar string with 1000000 a
s.
Be aware, however, that this pattern currently triggers a warning message
under -w saying it "matches the null string many
times"
):
(?
[^()]+ )>, a comparable
effect may be achieved by negative lookahead, as in [^()]+ (?! [^()]
)
. This was only 4 times slower on a string with 1000000
a
s.
(condition)
should be either an integer
in parentheses (which is valid if the corresponding pair of parentheses matched),
or lookahead/lookbehind/evaluate zero-width assertion.
m{ ( \( )? [^()]+ (?(1) \) ) }x
(?i)
at the front of the pattern. For
example:
$pattern = "foobar"; if ( /$pattern/i ) { }
# more flexible:
$pattern = "(?i)foobar"; if ( /$pattern/ ) { }
-
switch modifiers off.
( (?i) blah ) \s+ \1
x
modifier, and no i
modifier outside
of this group) will match a repeated (including the case!) word
blah
in any case.
A question mark was chosen for this and for the new minimal-matching construct because 1) question mark is pretty rare in older regular expressions, and 2) whenever you see one, you should stop and ``question'' exactly what is going on. That's psychology...
A fundamental feature of regular expression matching
involves the notion called backtracking, which is currently used
(when needed) by all regular expression quantifiers, namely *
,
*?
, +
, +?
, {n,m}
, and
{n,m}?
.
For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.
Here is an example of backtracking: Let's say you want to find the word following ``foo'' in the string ``Food is on the foo table.'':
$_ = "Food is on the foo table."; if ( /\b(foo)\s+(\w+)/i ) { print "$2 follows $1.\n"; }
When the match runs, the first part of the regular expression
(\b(foo)
) finds a possible match right at the beginning of the
string, and loads up $1
with ``Foo''. However, as soon as the
matching engine sees that there's no whitespace following the ``Foo'' that
it had saved in $1, it realizes its mistake and starts over again one character
after where it had the tentative match. This time it goes all the way until
the next occurrence of ``foo''. The complete regular expression matches this
time, and you get the expected output of ``table follows foo.''
Sometimes minimal matching can help a lot. Imagine you'd like to match everything between ``foo'' and ``bar''. Initially, you write something like this:
$_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "got <$1>\n"; }
Which perhaps unexpectedly yields:
got <d is under the bar in the >
That's because .*
was greedy, so you get everything between
the first ``foo'' and the last ``bar''. In this case, it's
more effective to use minimal matching to make sure you get the text between
a ``foo'' and the first ``bar'' thereafter.
if ( /foo(.*?)bar/ ) { print "got <$1>\n" } got <d is under the >
Here's another example: let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part the match. So you write this:
$_ = "I have 2 numbers: 53147"; if ( /(.*)(\d*)/ ) { # Wrong! print "Beginning is <$1>, number is <$2>.\n"; }
That won't work at all, because .*
was greedy and gobbled up
the whole string. As \d*
can match on an empty string the complete
regular expression matched successfully.
Beginning is <I have 2 numbers: 53147>, number is <>.
Here are some variants, most of which don't work:
$_ = "I have 2 numbers: 53147"; @pats = qw{ (.*)(\d*) (.*)(\d+) (.*?)(\d*) (.*?)(\d+) (.*)(\d+)$ (.*?)(\d+)$ (.*)\b(\d+)$ (.*\D)(\d+)$ };
for $pat (@pats) { printf "%-12s ", $pat; if ( /$pat/ ) { print "<$1> <$2>\n"; } else { print "FAIL\n"; } }
That will print out:
(.*)(\d*) <I have 2 numbers: 53147> <> (.*)(\d+) <I have 2 numbers: 5314> <7> (.*?)(\d*) <> <> (.*?)(\d+) <I have > <2> (.*)(\d+)$ <I have 2 numbers: 5314> <7> (.*?)(\d+)$ <I have 2 numbers: > <53147> (.*)\b(\d+)$ <I have 2 numbers: > <53147> (.*\D)(\d+)$ <I have 2 numbers: > <53147>
As you see, this can be a bit tricky. It's important to realize that a regular expression is merely a set of assertions that gives a definition of success. There may be 0, 1, or several different ways that the definition might succeed against a particular string. And if there are multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve.
When using lookahead assertions and negations, this can all get even tricker. Imagine you'd like to find a sequence of non-digits not followed by ``123''. You might try to write that as
$_ = "ABC123"; if ( /^\D*(?!123)/ ) { # Wrong! print "Yup, no 123 in $_\n"; }
But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of why it that pattern matches, contrary to popular expectations:
$x = 'ABC123' ; $y = 'ABC445' ;
print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ; print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ; print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
This prints
2: got ABC 3: got AB 4: got ABC
You might have expected test 3 to fail because it seems to a more general
purpose version of test 1. The important difference between them is that
test 3 contains a quantifier (\D*
) and so can use backtracking,
whereas test 1 will not. What's happening is that you've asked ``Is it true
that at the start of $x, following 0 or more non-digits, you have something
that's not 123?'' If the pattern matcher had let \D*
expand
to ``ABC'', this would have caused the whole pattern
to fail. The search engine will initially match \D*
with
``ABC''. Then it will try to match (?!123
with ``123'', which of course fails. But because a quantifier
(\D*
) has been used in the regular expression, the search engine
can backtrack and retry the match differently in the hope of matching the
complete regular expression.
The pattern really, really wants to succeed, so it uses the standard
pattern back-off-and-retry and lets \D*
expand to just
``AB'' this time. Now there's indeed something following
``AB'' that is not ``123''. It's in fact
``C123'', which suffices.
We can deal with this by using both an assertion and a negation. We'll say
that the first part in $1
must be followed by a digit, and in
fact, it must also be followed by something that's not ``123''. Remember
that the lookaheads are zero-width expressions--they only look, but don't
consume any of the string in their match. So rewriting this way produces
what you'd expect; that is, case 5 will fail, but case 6 succeeds:
print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ; print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
6: got ABC
In other words, the two zero-width assertions next to each other work as
though they're ANDed together, just as you'd use any builtin assertions:
/^$/
matches only if you're at the beginning of the line
AND the end of the line simultaneously. The deeper
underlying truth is that juxtaposition in regular expressions always means
AND, except when you write an explicit
OR using the vertical bar. /ab/
means match
``a'' AND (then) match ``b'', although the attempted
matches are made at different positions because ``a'' is not a zero-width
assertion, but a one-width assertion.
One warning: particularly complicated regular expressions can take exponential time to solve due to the immense number of possible ways they can use backtracking to try match. For example this will take a very long time to run
/((a{0,5}){0,5}){0,5}/
And if you used *
's instead of limiting it to 0 through 5 matches,
then it would take literally forever--or until you ran out of stack space.
A powerful tool for optimizing such beasts is
``independent'' groups, which do not backtrace (see
(?>pattern)
). Note also that zero-length
lookahead/lookbehind assertions will not backtrace to make the tail match,
since they are in ``logical'' context: only the fact whether they match or
not is considered relevant. For an example where side-effects of a lookahead
might have influenced the following match, see
(?>pattern)
.
In case you're not familiar with the ``regular'' Version 8 regex routines, here are the pattern-matching rules not described above.
Any single character matches itself, unless it is a metacharacter
with a special meaning described here or above. You can cause characters
that normally function as metacharacters to be interpreted literally by prefixing
them with a ``\'' (e.g., ``\.'' matches a ``.'', not any character; ``\\''
matches a ``\''). A series of characters matches that
series of characters in the target string, so the pattern blurfl
would match ``blurfl'' in the target string.
You can specify a character class, by enclosing a list of characters in
[]
, which will match any one character from the list. If the
first character after the ``['' is ``^'', the class matches any character
not in the list. Within a list, the ``-'' character is used to specify a
range, so that a-z
represents all characters between ``a'' and
``z'', inclusive. If you want ``-'' itself to be a member of a class, put
it at the start or end of the list, or escape it with a backslash. (The following
all specify the same class of three characters: [-az]
,
[az-]
, and [a\-z]
. All are different from
[a-z]
, which specifies a class containing twenty-six characters.)
Characters may be specified using a metacharacter syntax much like that used in C: ``\n'' matches a newline, ``\t'' a tab, ``\r'' a carriage return, ``\f'' a form feed, etc. More generally, \ nnn, where nnn is a string of octal digits, matches the character whose ASCII value is nnn. Similarly, \xnn, where nn are hexadecimal digits, matches the character whose ASCII value is nn. The expression \cx matches the ASCII character control-x. Finally, the ``.'' metacharacter matches any character except ``\n'' (unless you use /s).
You can specify a series of alternatives for a pattern using ``|'' to separate
them, so that fee|fie|foe
will match any of ``fee'', ``fie'',
or ``foe'' in the target string (as would f(e|i|o)e
). The first
alternative includes everything from the last pattern delimiter (``('', ``['',
or the beginning of the pattern) up to the first ``|'', and the last alternative
contains everything from the last ``|'' to the next pattern delimiter. For
this reason, it's common practice to include alternatives in parentheses,
to minimize confusion about where they start and end.
Alternatives are tried from left to right, so the first alternative found
for which the entire expression matches, is the one that is chosen. This
means that alternatives are not necessarily greedy. For example: when mathing
foo|foot
against ``barefoot'', only the ``foo'' part will match,
as that is the first alternative tried, and it successfully matches the target
string. (This might not seem important, but it is important when you are
capturing matched text using parentheses.)
Also remember that ``|'' is interpreted as a literal within square brackets,
so if you write [fee|fie|foe]
you're really only matching
[feio|]
.
Within a pattern, you may designate subpatterns for later reference by enclosing
them in parentheses, and you may refer back to the nth subpattern
later in the pattern using the metacharacter \n. Subpatterns are
numbered based on the left to right order of their opening parenthesis.
A backreference matches whatever actually matched the
subpattern in the string being examined, not the rules for that subpattern.
Therefore, (0|0x)\d*\s\1\d*
will match ``0x1234 0x4321'', but
not ``0x1234 01234'', because subpattern 1 actually matched ``0x'', even
though the rule 0|0x
could potentially match the leading 0 in
the second number.
Some people get too used to writing things like:
$pattern =~ s/(\W)/\\\1/g;
This is grandfathered for the RHS of a substitute to
avoid shocking the sed addicts, but it's a dirty habit to
get into. That's because in PerlThink, the righthand side of a s/// is a
double-quoted string. \1
in the usual double-quoted string means
a control-A. The customary Unix meaning of \1
is kludged in
for s///. However, if you get into the habit of doing that, you get yourself
into trouble if you then add an /e
modifier.
s/(\d+)/ \1 + 1 /eg; # causes warning under -w
Or if you try to do
s/(\d+)/\1000/;
You can't disambiguate that by saying \{1}000
, whereas you can
fix it with ${1}000
. Basically, the operation of interpolation
should not be confused with the operation of matching a backreference. Certainly
they mean two different things on the left side of the s///.
WARNING: Difficult material (and prose) ahead. This section needs a rewrite.
Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.
A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocous as:
'foo' =~ m{ ( o? )* }x;
The o?
can match at the beginning of 'foo'
, and
since the position in the string is not moved by the match, o?
would match again and again due to the *
modifier. Another common
way to create a similar cycle is with the looping modifier //g
:
@matches = ( 'foo' =~ m{ o? }xg );
or
print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
or the loop implied by split().
However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions which may match zero-length substrings, with a simple example being:
@chars = split //, $string; # // is not magic in split ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows the /()/
construct, which forcefully breaks
the infinite loop. The rules for this are different for lower-level
loops given by the greedy modifiers *+{}
, and for higher-level
ones like the /g
modifier or split()
operator.
The lower-level loops are interrupted when it is detected that a repeated expression did match a zero-length substring, thus
m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
is made equivalent to
m{ (?: NON_ZERO_LENGTH )* | (?: ZERO_LENGTH )? }x;
The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see Backtracking), and so the second best match is chosen if the best match is of zero length.
Say,
$_ = 'bar'; s/\w??/<$&>/g;
results in
"<
<b><><a><><r><>``>.
At each position of the string the best match given by non-greedy
??
is the zero-length match, and the second best match
is what is matched by \w
. Thus zero-length matches alternate
with one-character-long matches.
Similarly, for repeated m/()/g
the second-best match is the
match at the position one notch further in the string.
The additional state of being matched with zero-length is associated
to the matched string, and is reset by each assignment to pos().
Overloaded constants (see the overload manpage) provide a simple way to extend the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence
\Y|
which matches at boundary between white-space characters
and non-whitespace characters. Note that
(?=\S)(?<!\S)|(?!\S)(?<=\S)
matches exactly at these
positions, so we want to have each \Y|
in the place of the more
complicated version. We can create a module customre
to do this:
package customre; use overload;
sub import { shift; die "No argument to customre::import allowed" if @_; overload::constant 'qr' => \&convert; }
sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
my %rules = ( '\\' => '\\', 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ ); sub convert { my $re = shift; $re =~ s{ \\ ( \\ | Y . ) } { $rules{$1} or invalid($re,$1) }sgex; return $re; }
Now use customre
enables the new escape in constant regular
expressions, i.e., those without any runtime variable interpolations. As
documented in the overload manpage, this conversion will work only over literal
parts of regular expressions. For \Y|$re\Y|
the variable part
of this regular expression needs to be converted explicitly (but only if
the special meaning of \Y|
should be enabled inside $re):
use customre; $re = <>; chomp $re; $re = customre::convert $re; /\Y|$re\Y|/;