Getting Started with the Swift Frontend: Lexing & Parsing

This article contains some content especially for patrons. Although it reads coherently as-is, to read the full article, please consider supporting me on Patreon, or click here if you are already a patron. $10/month gives you access to all content I will ever write on this website.

A previous article in this series explained two primary ways of invoking the swift compiler executable: swift and swift -frontend.

  1. When invoking swift -frontend, the swift executable enters its main entry point and, once it sees the -frontend option, it begins to do everything you and I think of when we think of compilers: it attempts to lex the source file it's given, parse that file into a syntax tree, type-check it, produce an object file, and so on.
  2. When invoking just swift, without the -frontend option, the swift executable splits itself up into child invocations of swift -frontend. The logic that Swift uses to split itself up is in the libswiftDriver library.

Reading and Understanding the Swift Driver Source Code explains the libswiftDriver code that is executed in the second case. This article focuses on the first case: the "compiler-y" parts of the swift compiler executable.

In a nutshell, I aim to answer the question. "what happens when I compile this simple Swift program, hello.swift?"


1  // hello.swift
3  print("Hello, world!")

Parsing a Swift source file

First, I'll recap some details covered in previous articles. For example, I've explained that I can compile the hello.swift program on the command line, by invoking swiftc hello.swift. Even this simple invocation of swiftc, because it does not include the -frontend option, is split up into child jobs by the code in libswiftDriver. I can see these child jobs by invoking swiftc hello.swift -driver-print-jobs, which outputs something like the following:

swift -frontend \
    -c hello.swift \
    -o /tmp/hello.o
ld /tmp/hello.o \
    -lSystem -arch x86_64 -macosx_version_min 10.13.0 \
    -L /Users/bgesiak/Source/apple/build/Ninja-ReleaseAssert+swift-DebugAssert/swift-macosx-x86_64/lib/swift/macosx \
    -rpath /Users/bgesiak/Source/apple/build/Ninja-ReleaseAssert+swift-DebugAssert/swift-macosx-x86_64/lib/swift/macosx \
    -o hello

The first job invokes swift -frontend in order to produce an object file named hello.o, and the second job invokes the linker ld in order to link that object file into an executable named hello.

The first invocation appears to be very short, but make no mistake: it executes a lot of code, from a diverse set of libraries. These libraries include libswiftFrontend, libswiftParse, libswiftAST, libswiftSema, libswiftSIL, and more.

Covering each of these libraries in a single article would be exhausting. Instead, this article focuses on the first few phases of swift -frontend -c hello.swift. It'll cover libswiftFrontendTool, libswiftFrontend, and libswiftParse. These three libraries are used, in conjunction with libswiftAST, to build a tree structure that represents the untyped syntax tree of the Swift source file.

I can display the untyped syntax tree by invoking swiftc hello.swift -dump-parse, which outputs the following:

      (call_expr type='<null>' arg_labels=_:
        (unresolved_decl_ref_expr type='<null>' name=print function_ref=unapplied)
        (paren_expr type='<null>'
          (string_literal_expr type='<null>' encoding=utf8 value="Hello, world!" builtin_initializer=**NULL** initializer=**NULL**))))))

Note the difference between the untyped tree above and the typed syntax tree that swiftc -dump-ast produces:

      (call_expr type='()' location=hello.swift:3:1 range=[hello.swift:3:1 - line:3:22] nothrow arg_labels=_:
        (declref_expr type='(Any..., String, String) -> ()' location=hello.swift:3:1 range=[hello.swift:3:1 - line:3:1] decl=Swift.(file).print(_:separator:terminator:) function_ref=single)
        (tuple_shuffle_expr implicit type='(Any..., separator: String, terminator: String)' location=hello.swift:3:7 range=[hello.swift:3:6 - line:3:22] scalar_to_tuple elements=[-2, -1, -1] variadic_sources=[0] default_args_owner=Swift.(file).print(_:separator:terminator:)
          (paren_expr type='Any' location=hello.swift:3:7 range=[hello.swift:3:6 - line:3:22]
            (erasure_expr implicit type='Any' location=hello.swift:3:7 range=[hello.swift:3:7 - line:3:7]
              (string_literal_expr type='String' location=hello.swift:3:7 range=[hello.swift:3:7 - line:3:7] encoding=utf8 value="Hello, world!" builtin_initializer=Swift.(file).String.init(_builtinStringLiteral:utf8CodeUnitCount:isASCII:) initializer=**NULL**))))))))

Specifically, the untyped tree features nodes such as unresolved_decl_ref_expr, and many nodes also have a type='<null>' value. These are later filled in with type information as part of the type-checker, which is implemented in libswiftSema. I'll cover libswiftSema in a future article.

The first six stages of the Swift frontend: lexing & parsing

Just how does the swift executable read the text in the hello.swift file to construct an untyped syntax tree? How does it determine that print("Hello, world!") is a call_expr that wraps a paren_expr that wraps a string_literal_expr?

At a high level, the frontend completes parsing in about 6 stages:

$10+ patron-only content

The six main stages of the Swift frontend and parsing logic.

I am a patronBecome a patron

The rest of this article steps through the code behind these six steps, explaining them in more detail.

Stage 1: Parsing the -frontend argument

As explained in my libswiftDriver article, swift is just a C++ executable. Any invocation of a C++ executable begins in its main function. The swift executable's main function is defined in swift/tools/driver/driver.cpp.

Recall that one of the first things the Swift compiler's main function does is check for the first argument it's given. If that argument is -frontend, it calls the performFrontend function:


111  int main(int argc_, const char **argv_) {
158    StringRef FirstArg(argv[1]);
159    if (FirstArg == "-frontend") {
160      return performFrontend(llvm::makeArrayRef(,
161                                      ,
162                             argv[0], (void *)(intptr_t)getExecutablePath);
163    }

Note that -frontend must be the first argument; invoking swift -c hello.swift -frontend is not considered a frontend invocation, and the performFrontend function will not be called. The machinery described in my previous article, Option Parsing in the Swift Compiler, hasn't been initialized yet, and so libswiftOption and libLLVMOption are not used here to check for swift::options::ID::OPT_frontend. Instead, this is a naive string comparison.

Stage 2: Instantiating a CompilerInstance based on command-line arguments

$10+ patron-only content

An explanation of the CompilerInstance and CompilerInvocation classes.

I am a patronBecome a patron

2.1: Parsing frontend command-line arguments

If you've read Option Parsing in the Swift Compiler, the body of the CompilerInvocation::parseArgs member function should look very familiar to you. It parses arguments using the exact same libswiftOption and libLLVMOption abstractions as the Swift driver does: createSwiftOptTable and llvm::opt::OptTable::ParseArgs.


 962  bool CompilerInvocation::parseArgs(ArrayRef<const char *> Args,
 963                                     DiagnosticEngine &Diags,
 964                                     StringRef workingDirectory) {
 973    std::unique_ptr<llvm::opt::OptTable> Table = createSwiftOptTable();
 974    llvm::opt::InputArgList ParsedArgs =
 975        Table->ParseArgs(Args, MissingIndex, MissingCount, FrontendOption);
 990    if (ParseFrontendArgs(FrontendOpts, ParsedArgs, Diags)) {
 991      return true;
 992    }
 994    if (ParseLangArgs(LangOpts, ParsedArgs, Diags, FrontendOpts)) {
 995      return true;
 996    }
1030    return false;
1031  }

After converting the command-line strings into an llvm::opt::InputArgList via the llvm::opt::OptTable::ParseArgs member function, the CompilerInvocation::parseArgs member function calls functions like ParseFrontendArgs, ParseLangArgs, and so on, in order to set values on members such as CompilerInvocation::LangOpts.

For example, ParseLangArgs is responsible for translating the swift::options::ID::OPT_swift_version stored on the llvm::opt::InputArgList, and using that to set CompilerInvocation::LangOpts::EffectiveLanguageVersion. If the version passed in is invalid, it emits an error diagnostic:


125  static bool ParseLangArgs(LangOptions &Opts, ArgList &Args,
126                            DiagnosticEngine &Diags,
127                            const FrontendOptions &FrontendOpts) {
136    if (auto A = Args.getLastArg(OPT_swift_version)) {
137      auto vers = version::Version::parseVersionString(
138        A->getValue(), SourceLoc(), &Diags);
139      bool isValid = false;
140      if (vers.hasValue()) {
141        if (auto effectiveVers = vers.getValue().getEffectiveLanguageVersion()) {
142          Opts.EffectiveLanguageVersion = effectiveVers.getValue();
143          isValid = true;
144        }
145      }
146      if (!isValid)
147        diagnoseSwiftVersion(vers, A, Args, Diags);
148    }
355  }

To test this out, we can try passing swift -frontend an invalid language version, such as swift -frontend -c hello.swift -swift-version foo. This outputs:

<unknown>:0: error: version component contains non-numeric characters
<unknown>:0: error: invalid value 'foo' in '-swift-version foo'
<unknown>:0: note: valid arguments to '-swift-version' are '3', '4', '5'

If you're curious what other combinations of Swift language versions are valid, you can take a look at the libswiftBasic functions that the ParseLangArgs function uses above: Version::parseVersionString and Version::getEffectiveLanguageVersion.

One of the most important parts of this argument parsing is done in the ParseFrontendArgs function. This calls through to the ArgsToFrontendConverter::determineRequestedAction member function, in order to set the CompilerInvocation object's FrontendOptions::RequestedAction, based on whether the frontend was invoked with -emit-object, -emit-sil, or some other option. This "requested action" will determine what logic the frontend executes:


269  FrontendOptions::ActionType
270  ArgsToFrontendOptionsConverter::determineRequestedAction() const {
271    using namespace options;
272    const Arg *A = Args.getLastArg(OPT_modes_Group);
283    Option Opt = A->getOption();
284    if (Opt.matches(OPT_emit_object))
285      return FrontendOptions::ActionType::EmitObject;
286    if (Opt.matches(OPT_emit_assembly))
287      return FrontendOptions::ActionType::EmitAssembly;
288    if (Opt.matches(OPT_emit_ir))
289      return FrontendOptions::ActionType::EmitIR;
308    if (Opt.matches(OPT_dump_parse))
309      return FrontendOptions::ActionType::DumpParse;
310    if (Opt.matches(OPT_dump_ast))
311      return FrontendOptions::ActionType::DumpAST;
330    llvm_unreachable("Unhandled mode option");
331  }

My invocation of swift -frontend -c hello.swift does not appear to include the argument OPT_emit_object. However, a quick peek at (covered in depth in the article on Option Parsing in the Swift Compiler) reveals that -c is an alias for -emit-object:


571  def c : Flag<["-"], "c">, Alias<emit_object>,
572    Flags<[FrontendOption, NoInteractiveOption]>, ModeOpt;

2.2: Instantiating the ASTContext via the CompilerInstance::setup member function

$10+ patron-only content

An explanation of the ASTContext class, and how it is initialized.

I am a patronBecome a patron

Stage 3: Kicking off libswiftParse (and libswiftSema)

$10+ patron-only content

An explanation of of the performCompile function.

I am a patronBecome a patron

Stage 4: Opening a Swift.swiftmodule bitstream cursor and kicking off the parsing loop

$10+ patron-only content

An explanation of the CompilerInstance::performSema member function and the libswiftParseSIL compiler library.

I am a patronBecome a patron

4.1: More on the Lexer and tokens

The Parser initializer that's used in the parseIntoSourceFile function creates a new Lexer object:


329  Parser::Parser(unsigned BufferID, SourceFile &SF, SILParserTUStateBase *SIL,
330                 PersistentParserState *PersistentState)
331      : Parser(
332            std::unique_ptr<Lexer>(new Lexer(
333                SF.getASTContext().LangOpts, SF.getASTContext().SourceMgr,
334                BufferID, &SF.getASTContext().Diags,
335                /*InSILMode=*/SIL != nullptr,
336                SF.getASTContext().LangOpts.AttachCommentsToDecls
337                    ? CommentRetentionMode::AttachToNextToken
338                    : CommentRetentionMode::None,
339                SF.shouldKeepSyntaxInfo()
340                    ? TriviaRetentionMode::WithTrivia
341                    : TriviaRetentionMode::WithoutTrivia)),
342            SF, SIL, PersistentState) {}

A Lexer is responsible for reading in the individual characters in a source file and forming logical chunks, called tokens.

The Clang compiler, which compiles C, C++, and Objective-C source code, can print the tokens it lexes, using the -dump-tokens frontend option. For example, consider the following simple C program hello.c:

int main() {
  return 0;

I can invoke the Clang frontend, clang -cc1, to dump the tokens in this file (Clang has a driver and frontend system that is nearly identical to Swift's, except that Clang takes the argument -cc1 instead of -frontend). clang -cc1 -dump-tokens hello.c outputs the following:

int 'int'             Loc=<hello.c:1:1>  [StartOfLine]
identifier 'main'     Loc=<hello.c:1:5>  [LeadingSpace]
l_paren '('           Loc=<hello.c:1:9>
r_paren ')'           Loc=<hello.c:1:10>
l_brace '{'           Loc=<hello.c:1:12> [LeadingSpace]
return 'return'       Loc=<hello.c:2:3>  [StartOfLine] [LeadingSpace]
numeric_constant '0'  Loc=<hello.c:2:10> [LeadingSpace]
semi ';'              Loc=<hello.c:2:11>
r_brace '}'           Loc=<hello.c:3:1>  [StartOfLine]
eof ''                Loc=<hello.c:3:2>

The Swift compiler executable does not have an option to print the tokens in a file (although if any readers are interested in contributing, this would be a great feature to add!), but if it did, the tokens in hello.swift would be output like this:

identifier 'print'                Loc=<hello.swift:3:1>  [StartOfLine]
l_paren '('                       Loc=<hello.swift:3:6>
string_literal '"Hello, world!"'  Loc=<hello.swift:3:7>
r_paren ')'                       Loc=<hello.swift:3:22>
eof ''                            Loc=<hello.swift:3:23>

Note that the comment at the top of the file, // hello.swift, and the empty line below that comment, are not represented as tokens. The Swift compiler can be invoked such that it creates tokens for comments and whitespace, but normally they are discarded entirely by the compiler.

The first column of the output above displays the token "kind". Kinds of Swift tokens include identifier, l_paren, and eof. You can find a list of all the different Swift token kinds in swift/include/swift/Syntax/TokenKinds.def. An enum of all the different token kinds is defined in swift/include/swift/Syntax/TokenKinds.h, using a trick readers of my option parsing article should be familiar with: it defines the TOKEN macro, and then includes the TokenKinds.def file, which contais a call to TOKEN for each token kind.


21  enum class tok {
22  #define TOKEN(X) X,
23  #include "swift/Syntax/TokenKinds.def"
26  };

This creates an enum case for, for example, the if keyword. In that case, the enum case is named tok::kw_if:


 39  /// KEYWORD(kw)
 40  ///   Expands by default for every Swift keyword and every SIL keyword, such as
 41  ///   'if', 'else', 'sil_global', etc. If you only want to use Swift keywords
 42  ///   see SWIFT_KEYWORD.
 43  #ifndef KEYWORD
 44  #define KEYWORD(kw) TOKEN(kw_ ## kw)
 45  #endif
 47  /// SWIFT_KEYWORD(kw)
 48  ///   Expands for every Swift keyword.
 49  #ifndef SWIFT_KEYWORD
 50  #define SWIFT_KEYWORD(kw) KEYWORD(kw)
 51  #endif
 59  /// STMT_KEYWORD(kw)
 60  ///   Expands for every Swift keyword used in statement grammar.
 61  #ifndef STMT_KEYWORD
 62  #define STMT_KEYWORD(kw) SWIFT_KEYWORD(kw)
 63  #endif

The Token class stores information about a token: its kind and its text. It also defines member functions such as Token::is, so that other parts of the compiler can quickly check "is this token an if keyword?", by invoking Or, to check simply that the token is a keyword, the Token::isKeyword member function is implemented using a macro and an include:


 33  class Token {
 34    /// Kind - The actual flavor of token this is.
 35    ///
 36    tok Kind;
 52    /// Text - The actual string covered by the token in the source buffer.
 53    StringRef Text;
 61  public:
 76    /// is/isNot - Predicates to check if this token is a specific kind, as in
 77    /// "if ( {...}".
 78    bool is(tok K) const { return Kind == K; }
214    /// True if the token is any keyword.
215    bool isKeyword() const {
216      switch (Kind) {
217  #define KEYWORD(X) case tok::kw_##X: return true;
218  #include "swift/Syntax/TokenKinds.def"
219      default: return false;
220      }
221    }
302  };

4.2: Priming the Lexer to form a token for the identifier "print"

$10+ patron-only content

An explanation of how the lexer is primed, and the internals of how it lexes a token.

I am a patronBecome a patron

Stage 5: The lex & parse loop

$10+ patron-only content

An explanation of the main loop run by the parser.

I am a patronBecome a patron

5.1: A quick note on custom allocators in C++

Before getting into the Parser::parseExprOrStmt member function, take a closer look at the instantiation of TopLevelCodeDecl, which uses an interesting C++ feature:


226  ParserStatus Parser::parseBraceItems(SmallVectorImpl<ASTNode> &Entries,
227                                       BraceItemListKind Kind,
228                                       BraceItemListKind ConditionalBlockKind) {
347        auto *TLCD = new (Context) TopLevelCodeDecl(CurDeclContext);
427  }

I didn't have any experience writing C++ before I began working on Swift and Clang, so this call to new (Context) TopLevelCodeDecl(...) confused me. I was familiar with expressions such as new Foo(), which allocate memory for an instance of Foo. But the expression new (Context) TopLevelCodeDecl(...) seemed to have an extra element: what is (Context) here?

It turns out that C++ allows you to provide overrides of the new operator for specific classes, and those overrides can take additional parameters. The new operator's first argument must be a size_t that indicates how many bytes should be allocated, but beyond that you can define an arbitrary list of parameters. Here, Context is an argument being passed to new.

Swift's Decl class not only defines a custom new operator that takes an ASTContext argument, it also deletes the default new operator implementation:


235  /// Decl - Base class for all declarations in Swift.
236  class alignas(1 << DeclAlignInBits) Decl {
237  protected:
866    // Make vanilla new/delete illegal for Decls.
867    void *operator new(size_t Bytes) = delete;
870    // Only allow allocation of Decls using the allocator in ASTContext
871    // or by doing a placement new.
872    void *operator new(size_t Bytes, const ASTContext &C,
873                       unsigned Alignment = alignof(Decl));
878  };

This means that you cannot allocate memory for a Decl by calling new Decl() – you must call new (Context) Decl(). Doing so calls the ASTContext::Allocate member function:


97  // Only allow allocation of Decls using the allocator in ASTContext.
98  void *Decl::operator new(size_t Bytes, const ASTContext &C,
99                           unsigned Alignment) {
100    return C.Allocate(Bytes, Alignment);
101  }

You may also have noticed this syntax earlier in this article, when CompilerInstance::performSema called through to a function that created the SourceFile root node in the AST. The SourceFile class inherits from DeclContext, which declares its own overload of the new operator:


186  class alignas(1 << DeclContextAlignInBits) DeclContext {
554    // Only allow allocation of DeclContext using the allocator in ASTContext.
555    void *operator new(size_t Bytes, ASTContext &C,
556                       unsigned Alignment = alignof(DeclContext));
560  };

This also calls through to the ASTContext::Allocate function:


38  // Only allow allocation of DeclContext using the allocator in ASTContext.
39  void *DeclContext::operator new(size_t Bytes, ASTContext &C,
40                                  unsigned Alignment) {
41    return C.Allocate(Bytes, Alignment);
42  }

The ASTContext::Allocate member function is too interesting to explain in detail here. I'll write about it in a future article.

5.2: Parsing the print(...) expression

$10+ patron-only content

An explanation of how the parser creates AST nodes from the print(...) expression.

I am a patronBecome a patron

Stage 6: Reaching the end of the file

When Parser::parseExprList is called, the Parser tasks the Lexer with consuming each token within the parentheses that come after print. The parsing and lexing stops at ')', at which point the Lexer::NextToken is set to tok::eof – the end of the file.

Recall that all this parsing was occuring within a while loop in Parser::parseBraceItems. One of the termination conditions for that while loop was encountering an EOF marker. Now that the end of the file has been reached, Parser::parseBraceItems returns control back to Parser::parseTopLevel, which returns control back to parseIntoSourceFile, which returns back to CompilerInstance::parseAndTypeCheckMainFile. The stage is now set for type-checking, which is done by calling the performTypeChecking function:


659  void CompilerInstance::parseAndTypeCheckMainFile(
660      PersistentParserState &PersistentState,
661      DelayedParsingCallbacks *DelayedParseCB,
662      OptionSet<TypeCheckingFlags> TypeCheckOptions) {
677    bool Done;
678    do {
683      parseIntoSourceFile(MainFile, MainFile.getBufferID().getValue(), &Done,
684                          TheSILModule ? &SILContext : nullptr, &PersistentState,
685                          DelayedParseCB);
688        performTypeChecking(MainFile, PersistentState.getTopLevelContext(),
689                            TypeCheckOptions, CurTUElem,
690                            options.WarnLongFunctionBodies,
691                            options.WarnLongExpressionTypeChecking,
692                            options.SolverExpressionTimeThreshold);
695    } while (!Done);
713  }

I'll cover type-checking in a future article.

Recap: lexing and parsing in the Swift compiler

This article covered a lot of ground:

$10+ patron-only content

A summary of the article.

I am a patronBecome a patron

To add new pieces of Swift syntax, or to modify existing Swift syntax, it's helpful to understand the libswiftParse source code. For example:

To learn more about how parsing works in the Swift compiler, try writing small programs and walking through the code in the compiler to see how they're parsed. A good way to do this is by attaching a debugger – read the instructions from the first article in this series, Getting Started with Swift Compiler Development, to learn how to do so. There's still a lot left to learn that I didn't cover in this article. Here's just two examples of parser mechanics that this article didn't cover:

I hope this article has helped you learn about parsing in the Swift compiler. If you enjoyed this article and would like to read more like it, please consider supporting me on Patreon. I wouldn't be able to write these articles were it not for the support I receive.