Thursday, September 5, 2019
Linguistic Automatic Generation Natural Language
Linguistic Automatic Generation Natural Language 1. Introduction 1.1. The Problem Statement This thesis deals with the problem of Automatic generation of a UML Model from Natural Language Software Requirement Specifications. This thesis describes the development of Auto Modeler an Automated Software Engineering tool that takes Natural Language Software System Requirement Specifications as Input, performs an automated OO analysis and tries to produce an UML Model (a partial one in its present state i.e. static Class diagrams only) as output. The basis for Auto Modeler is described in [2][3]. 1.2. Motivation We conducted a short survey of the Software Industry in Islamabad in order to determine what sorts of Automated Software Engineering Tools were required by the Software houses. The result of the Survey (see Appendix-I for the survey report) indicated that there is demand for such a tool as Auto Modeler. Since such tools i.e. [2][3] that have already been developed are either not available in the market or are very expensive, and thus out of the reach of most software houses. Therefore we decided to build our own tool that can be used by the software industry in order to enable them to be more productive and competitive. But at present Auto Modeler is not ready for commercial use. But it is hoped that future versions of Auto Modeler will be able to cater to the needs of the Software Houses. 1.3. Background 1.3.1. The need for Automated Software Engineering Tools: In this era of Information Technology great demands are placed on Software Systems and on all those that are involved in the SDLC. The developed software should not only be of high quality but it should also be developed in minimal amount of time. When it comes to Software quality, the software must be highly reliable and it should meet the customers needs and it should satisfy the customers expectations. Automated Software Engineering Tools can assist the Software Engineers and Software Developers in producing High Quality Software in minimal amount of time. 1.3.2. Requirements Engineering: Requirements engineering consists of the following tasks [6]: à · Requirements Elicitation à · Requirements Analysis à · Requirements Specification à · Requirements Validation / Verification à · Requirements Management Requirements engineering is recognized as a critical task, since many software failures originate from inconsistent, incomplete or simply incorrect System Requirements specifications. 1.3.3. Natural Language Requirement Specifications: Formal methods have been successfully used to express Requirements Specifications, but often the customer cannot understand them and therefore cannot validate them [4]. Natural Language is the only common medium understood by both the Customer and the Analyst [4]. So the System Requirements Specifications are often written in Natural Language. 1.3.4. Object Oriented Analysis: The System Analyst must manually process The Natural Language Requirements Specifications Document and perform an OO Analysis and produce the results in the form of an UML Model, which has become a Standard in the Software Industry. The manual process is laborious, time consuming and often prone to errors. Some specified requirements might be left out. If there are problems or errors in the original requirements specifications, they may not be discovered in the manual process. OOA applies the OO paradigm to models of proposed systems by defining classes, objects and the relationships between them. Classes are the most important building block of an OO system and from these we instantiate objects. Once an individual object is created it inherits the same operations, relationships, semantics, and attributes identified in the class. Attributes of classes, and hence objects, hold values of properties. Operations, also called methods, describe what can be done to an object/class.[1] A relationship between classes/objects can show various attributes such as aggregation, composition, generalization and dependency. Attributes and operations represent the semantics of the class, while relationships represent the semantics of the model [1]. The KRB seven-step method, introduced by Kapur, Ravindra and Brown, proposes how to find classes and objects manually [1]. Hence, Identify candidate classes (nouns in NL). Define classes (look for instantiations of classes). Establishing associations (capturing verbs to create association for each pair of classes in 1 and 2). Expanding many-to-many associations. Identify class attributes. Normalize attributes so that they are associated with the class of objects that they truly describe. Identify class operations. From this process we can see that one goal of OOA is to identify NL concepts that can be transformed into OO concepts; which can then be used to form system models in particular notations. Here we shall concentrate on UML [1]. 1.3.5. Natural Language Processing (NLP): If an automatic analysis of the NL Requirements Document is carried out then it is not only possible to quickly find errors in the Specifications but with the right methods we can quickly generate a UML model from the Requirements. Although, Natural language is inherently ambiguous, imprecise and incomplete; often a natural language document is redundant, and several classes of terminological problems (e.g., jargon or specialist terms) can arise to make communication difficult [2] and it has been proven that Natural Language processing with holistic objectives is a very complex task, it is possible to extract sufficient meaning from NL sentences to produce reliable models. Complexities of language range from simple synonyms and antonyms to such complex issues as idioms, anaphoric relations or metaphors. Efforts in this particular area have had some success in generating static object models using some complex NL requirement sentences. 1.3.5.1. Linguistic analysis: Linguistic analysis studies NL text from different linguistic levels, i.e. words, sentence and meaning.[1] (i) Word-tagging analyses how a word is used in a sentence. In particular, words can be changeable from one sentence to another depending on context (e.g. light can be used as noun, verb, adjective and adverb; and while can be used as preposition, conjunction, verb and noun). Tagging techniques are used to specify word-form for each single word in a sentence, and each word is tagged as a Part Of Speech (POS), e.g. a NN1 tag would denote a singular noun, while VBB would signify the base form of a verb.[1] (ii) Syntactic analysis applies phrase marker, or labeled bracketing, techniques to segment NL as phrases, clauses and sentences, so that the NL is delineated by syntactical/grammatical annotations. Hence we can shows how words are grouped and connected to each other in a sentence.[1] (iii) Semantic analysis is the study of the meaning. It uses discourse annotation techniques to analyze open-class or content words and closed-class words (i.e. prepositions, conjunctions, pronouns). The POS tags and syntactic elements mentioned previously can be linked in the NL text to create relationships. Applying these linguistic analysis techniques, NLP tools can carry out morphological processing, syntactic processing and semantic processing. The processing of NL text can be supported by Semantic Network (SN) and corpora that provide a knowledge base for text analysis. The difficulty of OOA is not just due to the ambiguity and complexity of NL itself, but also the gap in meaning between the NL concepts and OO concepts.[1] 1.3.6. From NLP to UML Model Creation. After NLP the sentences are simplified in order to make identification of UML model elements form NL elements easy. Simple Heurists are used to Identify UML Model elements from Natural Text: (see Chapter 7) * Nouns indicate a class * Verb indicates an operation * Possessive relationships and Verbs like to have, identify, denote indicate attributes * Determiners are used to identify the multiplicity of roles in associations. 1.5. Plan of the thesis In Chapter 2 we present a brief survey of previous work and work similar to our work. Chapters 3, 4, 5, 6 and 7 describe the theoretical basis for Auto Modeler. Chapter 8 Describes the Architecture of Auto Modeler. In Chapter 9 we describe Auto Modeler in action with a case study. In Chapter 10 we present conclusions. 2. Literature Survey The first relevant published technique attempting to produce a systematic procedure to produce design models from NL requirements was Abbot. Abbott (1983) proposes a linguistic based method for analyzing software requirements, expressed in English, to derive basic data types and operations. [1] This approach was further developed by Booch (1986). Booch describes an Object-Oriented Design method where nouns in the problem description suggest objects and classes of objects, and verbs suggest operations.[1] Saeki et al. (1987) describe a process of incrementally constructing software modules from object-oriented specifications obtained from informal natural language requirements. Their system analyses the informal requirements one sentence at a time. Nouns and verbs are automatically extracted from the informal requirements but the system cannot determine which words are relevant for the construction of the formal specification. Hence an important role is played by the human analyst who reviews and refines the system results manually after each sentence is processed.[1] Dunn and Orlowska (1990) describe a natural language interpreter for the construction of NIAM (Nijssens, or Natural-language, Information Analysis Method ) conceptual schemas. The construction of conceptual schemas involves allocating surface objects to entity types (semantic classes) and the identification of elementary fact types. The system accepts declarative sentences only and uses grammar rules and a dictionary for type allocation and the identification of elementary fact types.[1] Meziane (1994) implemented a system for the identification of VDM data types and simple operations from natural language software requirements. The system first generates an Entity-Relationship Model (ERM) from the input text and then generates VDM data types from the ERM.[1] Mich and Garigliano (1994) and Mich (1996) describe an NL-based prototype system, NL-OOPS, that is aimed at the generation of object-oriented analysis models from natural language specifications. This system demonstrated how a large scale NLP system called LOLITA can be used to support the OO analysis stage.[1] V. Ambriola and V. Gervasi.[4] have developed CIRCE an environment for the analysis of natural language requirements. It is based on the concept of successive transformations that are applied to the requirements, in order to obtain concrete (i.e., rendered) views of models extracted from the requirements. CIRCE uses, CICO a domain-based, fuzzy matching, parser which parses the requirements document and converts it into an abstract parse tree. This parse tree is encoded as tuples and stored in a shared repository by CICO. A group of related tuples constitutes a T-Model. CIRCE uses internal tools to refine the encoded tuples called extensional knowledge and the knowledge about the basic behavior of software systems called intentional knowledge derived from modelers to further enrich the Tuple space. When a specific concrete view on the requirements is desired, a projector is called to build an abstract view of the data from the tuple space. A translator then converts the abstract view to a concrete view. In [5] V. Ambriola and V. Gervasi describe their experience of automatic synthesis of UML diagrams from Natural Language Requirement Specifications using their CIRCE environment. Delisle et al., in their project DIPETT-HAIKU, capture candidate objects, linguistically differentiating between Subjects (S) and Objects (O), and processes, Verbs (V), using the syntactic S-V-O sentence structure. This work also suggests that candidate attributes can be found in the noun modifier in compound nouns, e.g. reserved is the value of an attribute of ââ¬Å"reserved bookâ⬠.[1] Harmain and Gaizauskas developed a NLP based CASE tool, CM-Builder [2][3], which, automatically constructs an initial class model from NL text. It captures candidate classes, rather than candidate objects. Bà ¶rstler constructs an object model automatically based on pre-specified key words in a use case description. The verbs in the key words are transformed to behaviors and nouns are transformed to objects.[1] Overmyer and Rambow developed NLP system to construct UML class diagrams from NL descriptions. Both these efforts require user interaction to identify OO concepts.[1] The prototype tool developed by Perez-Gonzalez and Kalita supports automatic OO modeling from NL problem descriptions into UML notations, and produces both static and dynamic views. The underlying methodology includes theta roles and semi-natural language.[1] 3. Software Requirements Engineering Software requirements engineering is the science and discipline concerned with establishing and documenting software requirements [6]. It consists of: * Software requirements elicitation:- The process through which the customers (buyers and/or users) and the developer (contractor) of a software system discover, review, articulate, and understand the users needs and the constraints on the software and the development activity. * Software requirements analysis:- The process of analyzing the customers and users needs to arrive at a definition of software requirements. * Software requirements specification:- The development of a document that clearly and precisely records each of the requirements of the software system. * Software requirements verification:- The process of ensuring that the software requirements specification is in compliance with the system requirements, conforms to document standards of the requirements phase, and is an adequate basis for the architectural (preliminary) design phase. * Software requirements management:- The planning and controlling of the requirements elicitation, specification, analysis, and verification activities. In turn, system requirements engineering is the science and discipline concerned with analyzing and documenting system requirements. It involves transforming an operational need into a system description, system performance parameters, and a system configuration This is accomplished through the use of an iterative process of analysis, design, trade-off studies, and prototyping. Software requirements engineering has a similar definition as the science and discipline concerned with analyzing and documenting software requirements. It involves partitioning system requirements into major subsystems and tasks, then allocating those subsystems or tasks to software. It also transforms allocated system requirements into a description of software requirements and performance parameters through the use of an iterative process of analysis, design, trade-off studies, and prototyping. A system can be considered a collection of hardware, software, data, people, facilities, and procedures organized to accomplish some common objectives. In software engineering, a system is a set of software programs that provide the cohesiveness and control of data that enables the system to solve the problem.[6] The major difference between system requirements engineering and software requirements engineering is that the origin of system requirements lies in user needs while the origin of software requirements lies in the system requirements and/or specifications. Therefore, the system requirements engineer works with users and customers, eliciting their needs, schedules, and available resources, and must produce documents understandable by them as well as by management, software requirements engineers, and other system requirements engineers. The software requirements engineer works with the system requirements documents and engineers, translating system documentation into software requirements which must be understandable by management and software designers as well as by software and system requirements engineers. Accurate and timely communication must be ensured all along this chain if the software designers are to begin with a valid set of requirements. [6] 4. Automated Software Engineering Tools Software engineering is concerned with the analysis, design, implementation, testing, and maintenance of large software systems. Automated software engineering focuses on how to automate or partially automate these tasks to achieve significant improvements in quality and productivity. Automated software engineering applies computation to software engineering activities. The goal is to partially or fully automate these activities, thereby significantly increasing both quality and productivity. This includes the study of techniques for constructing, understanding, adapting and modeling both software artifacts and processes. Automatic and collaborative systems are both important areas of automated software engineering, as are computational models of human software engineering activities. Knowledge representations and artificial intelligence techniques applicable in this field are of particular interest, as are formal techniques that support or provide theoretical foundations.[7] Automated software engineering approaches have been applied in many areas of software engineering. These include requirements definition, specification, architecture, design and synthesis, implementation, modeling, testing and quality assurance, verification and validation, maintenance and evolution, configuration management, deployment, reengineering, reuse and visualization. Automated software engineering techniques have also been used in a wide range of domains and application areas including industrial software, embedded and real-time systems, aerospace, automotive and medical systems, Web-based systems and computer games.[7] Research into Automated Software Engineering includes the following areas: * Automated reasoning techniques * Component-based systems * Computer-supported cooperative work * Configuration management * Domain modeling and meta-modeling * Human-computer interaction * Knowledge acquisition and management * Maintenance and evolution * Model-based software development * Modeling language semantics * Ontologies and methodologies * Open systems development * Product line architectures * Program understanding * Program synthesis * Program transformation * Re-engineering * Requirements engineering * Specification languages * Software architecture and design * Software visualization * Testing, verification, and validation * Tutoring, help, and documentation systems 5. Natural Language Processing Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. 5.1. Language Processing Language processing can be divided into two tasks:[11] * Processing written text, using lexical, syntactic, and semantic knowledge of the language as well as any required real world information.[11] * Processing spoken language, using all the information needed above, plus additional knowledge about phonology as well as enough additional information to handle the further ambiguities that arise in speech.[11] 5.2. Uses for NLP: 5.2.1. User interfaces. Better than obscure command languages. It would be nice if you could just tell the computer what you want it to do. Of course we are talking about a textual interface not speech.[10] 5.2.2. Knowledge-Acquisition. Programs that could read books and manuals or the newspaper. So you dont have to explicitly encode all of the knowledge they need to solve problems or do whatever they do.[10] 5.2.3. Information Retrieval. Find articles about a given topic. Program has to be able somehow to determine whether the articles match a given query.[10] 5.2.4. Translation. It sure would be nice if machines could automatically translate from one language to another. This was one of the first tasks they tried applying computers to. It is very hard.[10] 5.3. Linguistic levels of Analysis Language obeys regularities and exhibits useful properties at a number of somewhat separable levels.[10] Think of language as transfer of information. It is much more than that. But that is a good place to start. Suppose that the speaker has some meaning that they wish to convey to some hearer.[10] Speech (or gesture) imposes a linearity on the signal. All you can play with is the properties of a sequence of tokens. Actually, why tokens? Well for one thing that makes it possible to learn.[10] So the other thing to play with is the order the tokens can occur. So somehow, a meaning gets encoded as a sequence of tokens, each of which has some set of distinguishable properties, and is then interpreted by figuring out what meaning corresponds to those tokens in that order.[10] Another way to think about it is that the properties of the tokens and their sequence somehow elicits an understanding of the meaning. Language is a set of resources to enable us to share meanings, but isnt best thought of as a means for *encoding* meanings. This is a sort of philosophical issue perhaps, but if this point of view is true, it makes much of the AI approach to NLP somewhat suspect, as it is really based on the encoded meanings view of language.[10] The lowest level is the actual properties of the signal stream: phonology speech sounds and how we make them morphology the structure of words syntax how the sequences are structured semantics meanings of the strings There are important interfaces among all of these levels. For example sometimes the meaning of sentences can determine how individual words are pronounced.[10] This many levels is obviously needed. But language turns out to be more clever than this. For example, language can be more efficient by not having to say the same thing twice, so we have pronouns and other ways of making use of what has already been said: A bear went into the woods. It found a tree. Also, since language is most often used among people who are in the same situation, it can make use of features of the situation: this/that you/me/they here/there now/then The mechanisms whereby features of the context, whether it is the context created by a sequence of sentences, or the actual context where the speaking happens is called pragmatics.[10] Another issue has to do with the fact that the simple model of language as information transfer is clealy not right. For one thing, we know there are at least the following three types of sentences: statements imperatives questions And each of them can be used to do a different kind of thing. The first *might* be called information transfer. But what about imperatives? What about questions? To some degree the analysis of such sentences can involve the ideas of a basic notion of meaning Speech acts.[10] There are other, higher-levels of structuring that language exhibits. For example there is conversational structure, where people know when they get to talk in a conversation, and what constitutes a valid contribution. There is narrative structure whereby stories are put together in ways that make sense and are interesting. There is expository structure which involves the way that informative texts (like encyclopedias) are arranged so as to usefully convey information. These issues blend off from linguistics into literature and library science, among other things.[10] Of course with hypertext and multi-media and virtual reality, these higher levels of structure are being explored in new ways.[10] 5.4. Steps in Natural Language Understanding The steps in the process of natural language understanding are:[11] 5.4.1. Morphological analysis Individual words are analyzed into their components, and non-word tokens (such as punctuation) are separated from the words. For example, in the phrase Bills house the proper noun Bill is separated from the possessive suffix s.[11] 5.4.2. Syntactic analysis. Linear sequences of words are transformed into structures that show how the words relate to one another. This parsing step converts the flat list of words of the sentence into a structure that defines the units represented by that list. Constraints imposed include word order (manager the key is an illegal constituent in the sentence I gave the manager the key); number agreement; case agreement.[11] 5.4.3. Semantic analysis. The structures created by the syntactic analyzer are assigned meanings. In most universes, the sentence Colorless green ideas sleep furiously [Chomsky, 1957] would be rejected as semantically anomalous. This step must map individual words into appropriate objects in the knowledge base, and must create the correct structures to correspond to the way the meanings of the individual words combine with each other. [11] 5.4.4. Discourse integration. The meaning of an individual sentence may depend on the sentences that precede it and may influence the sentences yet to come. The entities involved in the sentence must either have been introduced explicitly or they must be related to entities that were. The overall discourse must be coherent. [11] 5.4.5. Pragmatic analysis. The structure representing what was said is reinterpreted to determine what was actually meant. [11] 5.5. Syntactic Processing Syntactic parsing determines the structure of the sentence being analyzed. Syntactic analysis involves parsing the sentence to extract whatever information the word order contains. Syntactic parsing is computationally less expensive than semantic processing.[10] A grammar is a declarative representation that defines the syntactic facts of a language. The most common way to represent grammars is as a set of production rules, and the simplest structure for them to build is a parse tree which records the rules and how they are matched. [10] Sometimes backtracking is required (e.g., The horse raced past the barn fell), and sometimes multiple interpretations may exist for the beginning of a sentence (e.g., Have the students who missed the exam ). [10] Example: Syntactic processing interprets the difference between John hit Mary and Mary hit John. 5.6. Semantic Analysis After (or sometimes in conjunction with) syntactic processing, we must still produce a representation of the meaning of a sentence, based upon the meanings of the words in it. The following steps are usually taken to do this: [10] 5.6.1. Lexical processing. Look up the individual words in a dictionary. It may not be possible to choose a single correct meaning, since there may be more than one. The process of determining the correct meaning of individual words is called word sense disambiguation or lexical disambiguation. For example, Ill meet you at the diamond can be understood since at requires either a time or a location. This usually leads to preference semantics when it is not clear which definition we should prefer. [10] 5.6.2. Sentence-level processing. There are several approaches to sentence-level processing. These include semantic grammars, case grammars, and conceptual dependencies. [10] Example: Semantic processing determines the differences between such sentences as The ink is in the pen and The ink is in the pen. 5.6.3. Discourse and Pragmatic Processing. To understand most sentences, it is necessary to know the discourse and pragmatic context in which it was uttered. In general, for a program to participate intelligently in a dialog, it must be able to represent its own beliefs about the world, as well as the beliefs of others (and their beliefs about its beliefs, and so on).[10] The context of goals and plans can be used to aid understanding. Plan recognition has served as the basis for many understanding programs PAM is an early example. [10] 5.7. Issues in Syntax For various reasons, a lot of attention in computational linguistics has been paid to syntax. Partly this has to do with the fact that real linguistics have spent a lot of work on it. Partly because it needs to be done before just about anything else can be done. I wont talk much about morphology. We will assume that words can be associated with a set of features or properties. For example the word dog is a noun, it is singular, its meaning involves a kind of animal. The word dogs is related, obviously, but has the property of being plural. The word eat is a verb, it is in what we might call the base form, it denotes a particular kind of action. The word ate is related, it is in the past tense form. You can imagine Im sure that the techniques of knowledge representation that we have looked at can be applied to the problem of representing facts about the properties and relations among words. [11] The key observation in the theory of syntax is that the words in a sentence can be more or less naturally grouped into what are called phrases, and those phrases can often be treated as a unit. So in a sentence The dog chased the bear, the sequence the dog forms a natural unit. The sequence chased the bear is a natural unit, as is the bear.[11] Why do I say that the dog is a natural unit? Well one thing is that I can replace it by another sequence that has the same referent, or a related referent. For example I could replace it by: [11] Snoopy (a name) It (a pronoun) My brothers favorite pet (a more complex description) What about chased the bear? Again, I could replace it by died (a single word) was hit by a truck (a more complex event) This basic structure, in English, is sometimes called the subject-predicate structure. The subject is a nominal, something that can refer to an object or thing, the predicate is a verb phrase, which describes an action or event. Of course, as in the example, the verb phrase can also contain other constituents, for example another nominal. [11] These phrases also have structure. For example a noun phrase (a kind of nominal) can have a determiner, zero or more adjectives, and a noun, maybe followed by another phrase, like: the big dog that ate my homework Verb phrases can have complicated verb groups like will not be eaten Syntactic theories try to predict and explain what patterns are used in a language. Sometimes this involves figuring out what patterns just dont work. For example the following sentences have something wrong with them: [11] * the dogs runs home * he died the book * she saw himself in the mirror * they told it to she Figuring out exactly what is wrong with such sentences allows linguists to create theories that help understand the way that sentences
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.