PROGRAM TRANSLATORS
1. COMPILERS
A program translator is a computer program that performs the translation of a program written in a given programming language into a functionally equivalent program in a different computer language, without losing the functional or logical structure of the original code (the "essence" of each program).
These include translations between high-level and human-readable computer languages such as C++, Java and COBOL, intermediate-level languages such as Java bytecode, low-level languages such as the assembly language and machine code, and between similar levels of language on different computing platforms, as well as from any of these to any other of these.
They also include translators between software implementations and hardware/ASIC microchip implementations of the same program, and from software descriptions of a microchip to the logic gates needed to build it.
1. COMPILERS
A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code).
The most common reason for converting a source code is to create an executable program.
The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code).
If the compiled program can run on a computer whose CPU or operating system is different from the one on which the compiler runs, the compiler is known as a cross-compiler. More generally, compilers are a specific type oftranslators.
A program that translates from a low level language to a higher level one is a decompiler.
A program that translates between high-level languages is usually called a source-to-source compiler or transpiler.
A language rewriter is usually a program that translates the form of expressions without a change of language.
The term compiler-compiler is sometimes used to refer to a parser generator, a tool often used to help create the lexer and parser.
A compiler is likely to perform many or all of the following operations:
- Lexical analysis,
- Preprocessing,
- Parsing,
- Semantic analysis (syntax-directed translation),
- Code generation, and code optimization.
Compilers enabled the development of programs that are machine-independent.
Before the development of FORTRAN, the first higher-level language, in the 1950s, machine-dependentassembly language was widely used.
While assembly language produces more abstraction than machine code on the same architecture, just as with machine code, it has to be modified or rewritten if the program is to be executed on different computer hardware architecture.
Before the development of FORTRAN, the first higher-level language, in the 1950s, machine-dependentassembly language was widely used.
While assembly language produces more abstraction than machine code on the same architecture, just as with machine code, it has to be modified or rewritten if the program is to be executed on different computer hardware architecture.
With the advent of high-level programming languages that followed FORTRAN, such as COBOL, C, and BASIC, programmers could write machine-independent source programs. A compiler translates the high-level source programs into target programs in machine languages for the specific hardware. Once the target program is generated, the user can execute the program.
STRUCTURE OF COMPILER
Compilers bridge source programs in high-level languages with the underlying hardware.
A compiler verifies code syntax, generates efficient object code, performs run-time organization, and formats the output according to assembler and linker conventions.
A compiler consists of:
1.THE FRONT END
- It verifies syntax and semantics, and generates an intermediate representation or IR of the source code for processing by the middle-end.
- Performs type checking by collecting type information.
- Generates errors and warning, if any, in a useful way.
- Aspects of the front end include lexical analysis, syntax analysis, and semantic analysis.
- The compiler frontend analyzes the source code to build an internal representation of the program, called theintermediate representation or IR.
- It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope.
- While the frontend can be a single monolithic function or program, as in a scannerless parser, it is more commonly implemented and analyzed as several phases, which may execute sequentially or concurrently.
- In some cases additional phases are used, notably line reconstruction and preprocessing, but these are rare.
- A detailed list of possible phases includes:
- Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase before parsing, which converts the input character sequence to a canonical form ready for the parser.
- The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase.
- Atlas Autocode, and Imp (and some implementations ofALGOL and Coral 66) are examples of stropped languages which compilers would have a Line Reconstruction phase.
2. Lexical analysis
- It breaks the source code text into small pieces called tokens. Each token is a single atomic unit of the language, for instance a keyword, identifier or symbol name.
- The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it.
- This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner.
- This may not be a separate step – it can be combined with the parsing step in scannerless parsing, in which case parsing is done at the character level, not the token level.
3. Preprocessing.
- Some languages, e.g., C, require a preprocessing phase which supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Schemesupport macro substitutions based on syntactic forms.
4. Syntax analysis
- It involves parsing the token sequence to identify the syntactic structure of the program.
- This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax.
- The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
5. Semantic analysis
- It is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings.
- Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsingphase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.
2.THE MIDDLE END :
- Performs optimizations, including removal of useless or unreachable code, discovery and propagation of constant values, relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context. Generates another IR for the backend.
3. THE BACK END
Generates the assembly code, performing register allocation in process. (Assigns processor registers for the program variables where possible.)
Optimizes target code utilization of the hardware by figuring out how to keep parallel execution units busy, filling delay slots.
Although most algorithms for optimization are in NP, heuristic techniques are well-developed.
- The main phases of the back end include the following:
- Analysis: This is the gathering of program information from the intermediate representation derived from the input; data-flow analysis is used to build use-define chains, together withdependence analysis, alias analysis, pointer analysis, escape analysis, etc. Accurate analysis is the basis for any compiler optimization. The call graph and control flow graph are usually also built during the analysis phase.
- Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination, constant propagation, loop transformation, register allocation and even automatic parallelization.
- Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi-Ullman algorithm). Debug data may also need to be generated to facilitate debugging.
Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example, dependence analysis is crucial for loop transformation.
TYPES OF COMPILERS
1. SINGLE PASS COMPILER
- The ability to compile in a single pass has classically been seen as a benefit because it simplifies the job of writing a compiler and one-pass compilers generally perform compilations faster than multi-pass compilers.
- Thus, partly driven by the resource limitations of early systems, many early languages were specifically designed so that they could be compiled in a single pass (e.g., Pascal).
- In some cases the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass.
- The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated optimizations needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once.
- Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.
2. MULTI PASS COMPILER
- While the typical multi-pass compiler outputs machine code from its final pass, there are several other types:
- A "source-to-source compiler" is a type of compiler that takes a high level language as its input and outputs a high level language. For example, an automatic parallelizing compiler will frequently take in a high level language program as an input and then transform the code and annotate it with parallel code annotations (e.g. OpenMP) or language constructs (e.g. Fortran's
DOALL
statements).
3. INCREMENTAL COMPILER
- Individual functions can be compiled in a run-time environment that also includes interpreted functions. Incremental compilation dates back to 1962 and the first Lisp compiler, and is still used in Common Lisp systems.
- Lisp systems
4. STAGE COMPLIER
- The compilers that compiles to assembly language of a theoretical machine, like some Prolog implementations.
- This Prolog machine is also known as the Warren abstract machine (or WAM). Byte-code compilers for Java, Python (and many more) are also a subtype of this.
5. JUST IN TIME COMPILER
- In computing, just-in-time (JIT) compilation, also known as dynamic translation, is compilation done during execution of a program – at run time – rather than prior to execution.
- Most often this consists of translation to machine code, which is then executed directly, but can also refer to translation to another format.
- JIT compilation is a combination of the two traditional approaches to translation to machine code – ahead-of-time compilation (AOT), and interpretation – and combines some advantages and drawbacks of both.
- JIT compilation combines the speed of compiled code with the flexibility of interpretation, with the overhead of an interpreter and the additional overhead of compiling (not just interpreting).
- JIT compilation is a form of dynamic compilation, and allows adaptive optimization such as dynamic recompilation – thus in theory JIT compilation can yield faster execution than static compilation. Interpretation and JIT compilation are particularly suited for dynamic programming languages, as the runtime system can handle late-bound data types and enforce security guarantees.
ADVANTAGES OF COMPILER
- Source code is not included, therefore compiled code is more secure than interpreted code.
- Tends to produce faster code than interpreting source code.
- Produces an executable file, and therefore the program can be run without need of the source code.
DISADVANTAGES OF COMPILER
- Object code needs to be produced before a final executable file, this can be a slow process.
- The source code must be 100% correct for the executable file to be produced.
2. INTERPRETERS
In computer science, an interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program.
An interpreter is a program that reads in as input a source program, along with data for the program, and translates the source program instruction by instruction.
EXAMPLE
- The Java interpreter java translate a .class file into code that can be executed natively on the underlying machine.
- The program VirtualPC interprets programs written for the Intel Pentium architecture (IBM-PC clone) for the PowerPC architecture (Macintosh). This enable Macintosh users to run Windows programs on their computer.
An interpreter generally uses one of the following strategies for program execution:
- parse the source code and perform its behavior directly.
- translate source code into some efficient intermediate representation and immediately execute this.
- explicitly execute stored precompiled code made by a compiler which is part of the interpreter system.
APPLICATIONS
- Interpreters are frequently used to execute command languages, and glue languages since each operator executed in command language is usually an invocation of a complex routine such as an editor or compiler.
- Self-modifying code can easily be implemented in an interpreted language. This relates to the origins of interpretation in Lisp and artificial intelligence research.
- Virtualization. Machine code intended for one hardware architecture can be run on another using a virtual machine, which is essentially an interpreter.
- Sandboxing: An interpreter or virtual machine is not compelled to actually execute all the instructions the source code it is processing. In particular, it can refuse to execute code that violates any security constraints it is operating under.
ADVANTAGES OF INTERPRETER
- Easier to debug(check errors) than a compiler.
- Easier to create multi-platform code, as each different platform would have an interpreter to run the same code.
- Useful for prototyping software and testing basic program logic.
DISADVANTAGES OF INTERPRETER
- Source code is required for the program to be executed, and this source code can be read making it insecure.
- Interpreters are generally slower than compiled programs due to the per-line translation method.
3. ASSEMBLERS
An assembler translates assembly language into machine code.
An assembler is a program that creates object code by translating combinations of mnemonics and syntax for operations and addressing modes into their numerical equivalents.
Assembly language
- It consists of mnemonics for machine opcodes so assemblers perform a 1:1 translation from mnemonic to a direct instruction.
- An assembly language (or assembler language) is a low-level programming language for a computer, or other programmable device, in which there is a very strong (generally one-to-one) correspondence between the language and the architecture's machine code instructions.
- Each assembly language is specific to a particular computer architecture, in contrast to most high-level programming languages, which are generally portable across multiple architectures, but require interpreting or compiling.
- Assembly language is converted into executable machine code by a utility program referred to as an assembler; the conversion process is referred to as assembly, or assembling the code.
For example:
LDA #4
converts to 0001001000100100
Conversely, one instruction in a high level language will translate to one or more instructions at machine level.
TYPES OF ASSEMBLERS
There are two types of assemblers based on how many passes through the source are needed to produce the executable program.
- One-pass assemblers go through the source code once. Any symbol used before it is defined will require "errata" at the end of the object code (or, at least, no earlier than the point where the symbol is defined) telling the linker or the loader to "go back" and overwrite a placeholder which had been left where the as yet undefined symbol was used.
- Multi-pass assemblers create a table with all symbols and their values in the first passes, then use the table in later passes to generate code.
In both cases, the assembler must be able to determine the size of each instruction on the initial passes in order to calculate the addresses of subsequent symbols.
This means that if the size of an operation referring to an operand defined later depends on the type or distance of the operand, the assembler will make a pessimistic estimate when first encountering the operation, and if necessary pad it with one or more "no-operation" instructions in a later pass or the errata. In an assembler with peephole optimization, addresses may be recalculated between passes to allow replacing pessimistic code with code tailored to the exact distance from the target.
The original reason for the use of one-pass assemblers was speed of assembly – often a second pass would require rewinding and rereading a tape or rereading a deck of cards.
With modern computers this has ceased to be an issue. The advantage of the multi-pass assembler is that the absence of errata makes the linking process (or the program load if the assembler directly produces executable code) faster.
APPLICATIONS OF ASSEMBLERS
- Assembly language is typically used in a system's boot code, the low-level code that initializes and tests the system hardware prior to booting the operating system and is often stored inROM. (BIOS on IBM-compatible PC systems and CP/M is an example.)
- Some compilers translate high-level languages into assembly first before fully compiling, allowing the assembly code to be viewed for debugging and optimization purposes.
- Relatively low-level languages, such as C, allow the programmer to embed assembly language directly in the source code. Programs using such facilities, such as the Linux kernel, can then construct abstractions using different assembly language on each hardware platform. The system's portable code can then use these processor-specific components through a uniform interface.
- Assembly language is useful in reverse engineering. Many programs are distributed only in machine code form which is straightforward to translate into assembly language, but more difficult to translate into a higher-level language. Tools such as the Interactive Disassembler make extensive use of disassembly for such a purpose.
- Assemblers can be used to generate blocks of data, with no high-level language overhead, from formatted and commented source code, to be used by other code.
ADVANTAGES OF ASSEMBLER:
- Very fast in translating assembly language to machine code as 1 to 1 relationship.
- Assembly code is often very efficient (and therefore fast) because it is a low level language.
- Assembly code is fairly easy to understand due to the use of English-like mnemonics.
DISADVANTAGES OF ASSEMBLERS:
- Assembly language is written for a certain instruction set and/or processor.
- Assembly tends to be optimised for the hardware it's designed for, meaning it is often incompatible with different hardware.
- Lots of assembly code is needed to do relatively simple tasks, and complex programs require lots of programming time.
DIFFERENCE BETWEEN COMPILERS INTERPRETERS AND ASSEMBLERS
BASIS | COMPILERS | INTERPRETER | ASSEMBLER |
---|---|---|---|
1. DEFINITION | A compiler is a computer program that converts an entire program written in a high-level language (called source code) and translates it into an executable form (called object code). | An interpreter is a computer program that takes source code and converts each line in succession. | Assembler converts assembly languages to machine code than high-level programs languages. |
2. INPUT |
Compiler Takes Entire program as input
|
Interpreter Takes Single instruction as input .
| Input source program in Assembly Language through an input device. |
3. MEMORY REQUIREMENT |
Memory Requirement : More(Since Object Code is Generated)
|
Memory Requirement is Less
| |
4. ERRORS |
Errors are displayed after entire program is checked.
|
Errors are displayed for every instruction interpreted (if any)
| Error messages generated during an assembly may originate from the assembler or from a higher level language such as C (many assemblers are written in C) or from the operating system environment |