Legacy systems written in COBOL code and PL/I code continue to underpin critical business processes in sectors like banking, insurance and government. However, extracting meaningful business requirements from COBOL code and PL/I code is labor-intensive, error-prone and requires domain expertise. To address this, we present an AI-powered AI/Run Platform that uses OpenAI's large language models (LLMs) to automate and optimize this process. By parsing COBOL code and PL/I code, the AI/Run Platform generates comprehensive business requirements documentation, including business rules, lineage analysis, CRUD matrices and source-to-target data mappings. This paper discusses the architecture, workflows and evaluation metrics of the platform, showcasing its effectiveness in modernizing legacy systems and accelerating digital transformation efforts.
Introduction
COBOL language and PL/I have long been at the core of enterprise software, powering mission-critical systems for decades. As demands for agility and modernization increase, organizations face the dual challenge of limited COBOL expertise and the need to efficiently extract valuable business knowledge hidden within legacy code. Harnessing GenAI for these tasks not only accelerates modernization efforts but also preserves business logic and improves collaboration between technical and business stakeholders.
Background and Motivation
COBOL, developed in the 1950s, still powers many mission-critical systems. However, with declining COBOL and PL/I expertise and the increasing demand for software modernization, there is a need to extract business logic, rules and data lineage from COBOL and PL/I code into comprehensible and reusable formats. The requirements may then be utilized to generate the new code based on modern architecture and design to provide faster time to market, making easier changes and deploying it to hybrid/cloud infrastructure. Manual methods fall short due to the complex structure of COBOL and PL/I programs, a lack of proper documentation and the time required for comprehensive extraction.
Programming language one (PL/I) was developed in the early 1960s by IBM. It was designed to combine the features of scientific, engineering and business programming languages, aiming to be a universal language suitable for a wide range of applications. PL/I incorporated features from FORTRAN, COBOL, PL/I and ALGOL, making it versatile and powerful.
One of the biggest challenges towards modernization is the understanding of the current programs, the product owners or business analysts to be able to provide the business rules and requirements with the current systems. There always remains a chance that some of the requirements, validations and rules are not covered as the code wasn't executed in a long time.
AI tools, particularly those based on LLMs like OpenAI, have demonstrated prowess in understanding natural language and structured formats. The application of LLMs to extract business-level specifications from COBOL and PL/I code provides a promising solution by bridging the gap between legacy systems and modern software engineering practices.
Given the current tools in the market, if we were to average out on the key metrics, we get the following results:
-
Accuracy of Analysis — Medium
-
Effort Saving — 40-50%
-
User Feedback — Ease of use, however, requires fine-tuning to get better accuracy
Contributions
This paper introduces a novel AI-powered platform that:
-
Automatically parses COBOL and PL/I code and generates business requirement documents (BRDs).
-
Extracts and outlines business rules embedded within the code.
-
Performs lineage analysis for data and system dependencies.
-
Performs deep analysis for job control languages (JCLs).
-
Constructs CRUD matrices and source-to-target data mappings to provide insights into the system's structure.
-
Reduces the time and complexity of translating legacy code into actionable business requirements.
Related Work
While tools exist for COBOL and PL/I parsing and code analysis, few extend beyond static code analysis to deliver meaningful business-level insights. Relevant prior work includes:
-
Static Code Analysis Tools: Traditional COBOL and PL/I parsers extract procedural dependencies and data models, but fail to align these artifacts with business contexts.
-
Reverse Engineering Research: Studies focus on reverse engineering of legacy systems. However, these approaches remain resource-intensive and heavily manual.
-
Applications of LLMs in Software Engineering: LLMs have been successfully used for code autocompletion and natural language explanations, but their application to COBOL and PL/I and business specification extraction is nascent.
Our work fills this gap by using advanced natural language understanding capabilities of OpenAI's LLM to extract both structural and business-level insights from COBOL and PL/I code.
Methodology
Our approach combines advanced natural language processing (NLP) with robust static code analysis to transform complex COBOL and PL/I codebases into actionable business requirements. By integrating AI-driven interpreters with flexible parsing and interactive interfaces, the platform streamlines knowledge extraction while supporting validation, customization and collaboration at each step, making it practical for large-scale business data processing modernization projects.
Architecture Overview
The platform comprises the following high-level components:
-
Interfaces: The user is provided with a GUI to upload the COBOL and PL/I source code and dependencies either in the raw files zipped together or compiler listings.
The interface also contains an interactive Chatbot to provide answers to any questions asked by the user after the processing of the COBOL programming language and PL/I.
-
Source Code Processors: Extracts syntax tree representations, control flow and data flow from COBOL programs and PL/I. These consist of another tool for language recognition (ANTLR) and an abstract syntax tree (AST) builder for understanding the overall structure and syntax for further processing.
-
Data Ingress Processors: Data is validated and enriched using the ingress processors and stored in SQL DB before being processed by LLM.
-
App Services: This layer provides services for further processing and interacting with LLM in API/Batch calls.
-
LLM Interpreter (OpenAI): Processes COBOL and PL/I code to generate plain language descriptions of business logic, rules and system operations.
-
Artifact Generators:
-
BRD Generator: Summarizes extracted information on the UI with defined sections of a BRD.
-
Business Rules Extractor: Identifies operational constraints, conditions and policies.
-
Lineage Analyzer: Tracks dependencies and relationships between program components.
-
CRUD Matrix and Source-to-Target Mapping Generator: Highlights create-read-update-delete operations and data transformations in a Graph UI (Neo4J) to represent the flow.
-
Workflow
-
Input: COBOL and PL/I source code is fed into the system.
-
Parsing: The COBOL and PL/I parser identifies program modules, data definitions and procedural flows.
-
LLM Analysis: The parsed code is processed by the LLM, which translates technical syntax into human-readable requirements and rules.
-
Artifact Generation: The resulting information is synthesized into structured documents.
-
Output: Provides detailed BRDs, lineage analysis, CRUD matrices and source-to-target mappings.
Example Use Case
Consider a COBOL and PL/I program managing insurance claim processing. The platform extracts and documents:
-
Policies governing claim approvals (business rules).
-
Data lineage tracking how input data (e.g., claim details) transforms into outcomes.
-
CRUD operations linking claims to persistent database operations.
-
Source-to-target mappings showing field-level dependencies
Evaluation
To assess the real-world value of the GenAI-powered extraction process, we applied the solution to diverse COBOL-based systems and PL/I applications in production environments. Evaluation focused on accuracy, coverage, time savings and user feedback from business analysts and COBOL programmers, benchmarking performance against traditional manual review and static code analysis tools.
Sample Use Case
As part of the manufacturing industry use case with a total of 39M LoC of PL/I, the platform was utilized to perform the reverse engineering on a program. The task was to take the BRD outcome to be taken up by the product owner for forward engineering and product development on the open-source platform. The results were encouraging, and it improved the efficiency of manual reverse engineering by 85%.
Metrics
The platform is evaluated based on:
-
Accuracy: How well the extracted artifacts match manually generated ones.
-
Coverage: Completeness of the business logic and rules extraction.
-
Processing Time: Time taken for each task compared to manual methods.
-
User Feedback: Ease of understanding the generated BRDs and relevance of insights.
Results
The platform achieved:
-
85% accuracy in identifying and documenting business rules.
-
Significant time savings, reducing documentation efforts by 70%.
-
Positive feedback from business analysts, emphasizing the readability of generated artifacts.
Discussion
The results demonstrate that GenAI-driven platforms not only speed up requirement extraction from complex COBOL divisions (including data division, procedure division and working storage section), but also ensure higher accuracy and better alignment with business objectives. While challenges remain for edge cases and highly customized code, the potential for sustainable, large-scale COBOL development modernization is clear.
Benefits and Applications
-
Accelerated Digital Transformation: The platform helps in migrating COBOL and PL/I-based business logic to modern systems/environments.
-
Knowledge Preservation: Extracted documentation serves as a resource for training and historical reference.
-
Improved Collaboration: Business stakeholders gain better visibility into legacy systems, fostering collaboration with IT teams.
Challenges
-
Managing large and convoluted COBOL and PL/I programs.
-
Ensuring generalizability to non-COBOL and PL/I (or hybrid) programs and undocumented codebases.
-
Enhancing the interpretability of outputs from LLMs for domain-specific use cases.
Costs
-
License: License costs are usually per user and vary from tool to IP. Most of the organizations keep the costs to a minimum to ensure the product's maturity.
-
LLM Tokens: As the AI service providers are making it competitive by offering lower rates for token usage, it is going to be a commodity sooner or later. The cost of LLM tokens is very minimal. For example, processing 3.2 million lines of code (LoC) across 100 runs on the platform costs approximately $4,000.
Future Work
Future improvements may include:
-
Extending support to more programming languages.
-
Incorporating feedback loops to iteratively refine outputs.
-
Exploring fine-tuning techniques to customize LLMs for COBOL and PL/I-specific nuances.
Conclusion
We presented an AI-powered approach for extracting business requirements from COBOL programs and PL/I programs using OpenAI LLMs. The platform automates the generation of BRDs, business rules, lineage analysis, CRUD matrices and source-to-target mappings. By addressing challenges associated with legacy systems, the platform accelerates legacy modernization efforts and enhances understanding of critical business processes. Future work aims to refine accuracy, broaden support and integrate domain-specific customizations. Converting legacy code (procedural) to modern open source (object-oriented) using the platform is not very helpful, as it doesn't provide the right parsing and brings the anomalies from one language to another.
Also, the design for open systems is domain-driven, and microservices architectures help scale, which isn't possible with one-to-one code conversion. The code-to-specs parser helps solve this by creating an application ground up from the requirements extracted from existing code and providing the right focus on design and scalability before the code is written in any open-source language. This strategy also helps save time and effort for refactoring and code hardening in the later stages of migration to reduce costs and time to market.
FAQs
How secure is it to upload COBOL programs or PL/I source code to an AI-powered platform for analysis?
Leading GenAI solutions for COBOL applications and business data processing support encrypted file transfers, access control and deployment within secure, on-premises environments. Always check if the platform complies with industry security protocols and offers safeguards for critical business data, especially in regulated sectors like banking systems and transaction processing.
Can generative AI tools understand customized or complex COBOL divisions, such as the data division, procedure division and working storage section?
GenAI tools handle standard COBOL syntax well, including common data division patterns, file sections and procedure logic. However, legacy systems often contain custom COPYBOOKs, compiler-specific extensions, REDEFINES overlays or vendor macros that general models do not reliably interpret. This is why the most accurate platforms rely on custom lexical parsers and AST generation before feeding normalized code into the LLM.
What technical background do business analysts or COBOL programmers need to work with GenAI-based business requirement extraction tools?
Business analysts do not need deep COBOL expertise, but understanding the main COBOL divisions, basic PERFORM/EVALUATE logic and common data structures helps them validate extracted business rules. Modern GenAI tools present the logic in English, so analysts can focus on domain meaning while engineers handle syntax-specific validation.
How can extracted requirements from COBOL programs be used to accelerate the modernization of COBOL systems?
Extracted business rules, record flows and CRUD maps accelerate modernization by turning implicit logic buried in COBOL into explicit, verifiable artifacts. Instead of performing line-by-line porting, teams can design modular, domain-driven architectures informed by the actual business behavior of the legacy system. This reduces misinterpretation risks and supports cleaner re-platforming strategies.
Are there integration options with DevOps pipelines and enterprise tools used in COBOL programming and system maintenance?
Modern AI-assisted platforms can integrate with enterprise DevOps tooling, including Git-based repositories, mainframe CI/CD orchestrators, ticketing systems and code quality tools. This enables automated documentation refresh, traceability of extracted requirements and collaborative validation between COBOL engineers and modernization teams.
