Thursday, 24 August 2017

How a Compiler Translates a Programming Language Statement?

Lets assume a statement m=a+b*c-4 written in some programming language (e.g. C ). So for running this statement we have to translate the statement into binary code or machine language which will then gets executed. The gcc or Turbo C compiler translate the above statement with the help of following six phases.

  1. Lexical Analysis: Every high level programming language is made up of some constructs like operators, keywords, expression, methods, syntax and the rules to write any statement. To translate the statement first the machine must understands the meaning of all the components of the statement m=a+b*c-4. These components are called as tokens means the meaningful constructs of any programming language. m, a, b, c are identifiers; =, +, - are operators. So these constructs are called as tokens. Identifiers, keywords, functions, delimiters, numbers etc are the tokens. Lexical analysis convert the statement into statement of tokens with lexeme. Lexeme means the pattern for a particular type of token. e.g. = is a lexeme of type operator token and m is a lexeme of type identifier token with a pattern represented by Regular Expression. The regular expression for digit (here 4) is [0-9]+.  So the lexical expression is like id=id+id*id-digit. The identifiers are stored in symbol table to record the attributes of the identifier like its data type, value etc which will be used in further phases.
  2. Syntax Analysis: Syntax of a statement means the rules to evaluate the expression or the statement. Syntax rules includes operator priority, position of identifiers and symbols, evaluation order etc. So syntax analyzer checks the syntax of the statement by generating a parse tree. Parsing can be top down or bottom up. The rules for generating a parse tree is represented by context free grammar. So the start symbol is generated using the rules with help of a parsing table to generate the parse tree. Various algorithms are designed to generate the parse tree like LALR parser.  
  3. Semantic Analysis: Symbol table is used to check the meaning of the statement by using the identifiers attributes. SDD (Syntax directed definition) is used for the Semantic analysis. SDD means CFG + Sementic Rules. It will help to evaluate the expression after parsing.
  4. Intermediate Code Generation: It is optional phase. To ease the code generation phase this phase is required. This phase minimizes the overhead on code generation by representing the syntax tree using some representation like three address code. It will also increase the portability of the compiler. Since the intermediate representation is machine independent, we can easily port the code and generate the code for some other machine architecture. The intermediate codes are Three address code (Quadraple, Trirples), Directed Acyclic Graph, Abstract Syntax Tree etc.
  5. Code Optimisation: We can optimize the code before code generation and after code generation. It is less overhead for compiler to optimise the code before code generation. Various techniques are used to optimize the code e.g. dead code elemination, loop hole optimization etc.
  6. Code Generation: Every machine architecture has its own instruction set called as Instruction Set Architecture. For executing any program on a particular machine the program must be converted into the ISA. In the final phase the optimized intermediate code has been converted into ISA. The ISA is then loaded into RAM and gets executed.

1 comment: