May 7, 2018

The prior article in this series explained how the Swift and Clang compilers used llvm::SourceMgr to emit diagnostics for source locations in memory buffers, represented by the class llvm::MemoryBuffer. This article focuses on llvm::MemoryBuffer, the primary abstraction for reading files and streams into memory. Since it's used by Swift, Clang, and LLVM tools like llvm-tblgen, I found it valuable to understand how it works.

Reading a file into memory using C++

The documentation for libLLVMSupport's llvm::MemoryBuffer class says it "provides simple read-only access to a block of memory, and provides simple methods for reading files and standard input into a memory buffer." To better understand how it does that, I tried writing a simple C++ program, called read.cpp, that reads a file – itself, in this case – into memory. For simplicity's sake my program is only meant to operate on Unix systems.

My read.cpp program reads a file into memory by using various system calls. These are requests made to the operating system for things like "open a file and give me its file descriptor," or "read 8 bytes from the file with this file descriptor." Julia Evans has a wonderful comic that explains them further:

Your program doesn't know how to, for example, open a file on the filesystem, but the Linux operating system does. Your program can ask the operating system to do this via a "system call."

My read.cpp program uses four system calls:

open(2) to get a file descriptor for the file.
fstat, which returns information about a file descriptor. Specifically, read.cpp allocates memory based on the file's size.
read(2), which reads a given number of bytes from a file into a pre-allocated block of memory.
close(2) to close a file descriptor once I'm done using it.

Once the read.cpp program allocates memory and reads its own source file into that memory, it increments the char * pointer into the memory and prints out the first line of the file:

read.cpp

 1  #include <cerrno>
 2  #include <iostream>
 3  #include <system_error>
 4  
 5  #include <fcntl.h>
 6  #include <sys/stat.h>
 7  #include <unistd.h>
 8  
 9  int main() {
10    // I'll open this file itself and read it into memory.
11    auto FileName = __FILE__;
12  
13    // The system call open(2) gets a file descriptor
14    // representing the open file.
15    int OpenFlags = O_RDONLY;
16    int FD = open(FileName, OpenFlags);
17  
18    // open(2) returns a -1 if the file could not be opened.
19    // In this case, print an error and return.
20    if (FD < 0) {
21      std::error_code Err(errno, std::generic_category());
22      std::cerr << "[ERROR] Could not open file \""
23                << FileName << "\": " << Err.message()
24                << std::endl;
25      return 1;
26    }
27  
28    // Syscall fstat populates the struct stat pointer
29    // with information about the given file descriptor,
30    // including the file's size in bytes.
31    struct stat Stat;
32    if (fstat(FD, &Stat) < 0) {
33      std::error_code Err(errno, std::generic_category());
34      std::cerr << "[ERROR] Could not acquire information "
35                << "on file descriptor \"" << FD
36                << "\": " << Err.message() << std::endl;
37      return 1;
38    }
39  
40    off_t FileSize = Stat.st_size;
41    std::cout << "[NOTE] File size: " << FileSize << " bytes"
42              << std::endl;
43  
44    // Allocate memory in size equal to the number of bytes
45    // in the file.
46    char *Memory = static_cast<char *>(operator new(
47        FileSize + 1, std::nothrow));
48    Memory[FileSize] = 0;
49  
50    // Use syscall read(2) to read in bytes from the given
51    // file descriptor, into the prepared buffer, 16 bytes
52    // at a time.
53    const ssize_t ChunkSize = 16;
54    ssize_t Offset = 0;
55    ssize_t ReadBytes = 0;
56    do {
57      ReadBytes = read(FD, Memory + Offset, ChunkSize);
58      if (ReadBytes < 0) {
59        std::error_code Err(errno, std::generic_category());
60        std::cerr << "[ERROR] Could not read from file "
61                     "descriptor \""
62                  << FD << "\": " << Err.message()
63                  << std::endl;
64        delete Memory;
65        return 1;
66      }
67      Offset += ReadBytes;
68    } while (ReadBytes != 0);
69  
70    // I've now read the file into memory. To demonstrate:
71    std::cout << "[NOTE] Here's the first line "
72              << "of the file: \"";
73    char *Ptr = Memory;
74    while (*Ptr != '\n' && *Ptr != '\0') {
75      std::cout << *Ptr;
76      ++Ptr;
77    }
78    std::cout << "\"" << std::endl;
79  
80    // Once I'm done with the file, I need to delete the
81    // memory I allocated, otherwise this is a memory leak.
82    delete Memory;
83  
84    // Finally, I need to close the open file descriptor,
85    // using the system call close(2).
86    if (close(FD) < 0) {
87      std::error_code Err(errno, std::generic_category());
88      std::cerr << "[ERROR] Could not close file "
89                << "descriptor \"" << FD << "\":"
90                << Err.message() << std::endl;
91      return 1;
92    }
93  
94    return 0;
95  }

I can compile and run this program like so:

clang++ read.cpp -o my-read-example
./my-read-example
[NOTE] File size: 2820 bytes
[NOTE] Here's the first line of the file: "#include <cerrno>"

This is a good initial implementation of reading a file into memory in C++. In fact, this is very similar to what the llvm::MemoryBuffer::getFile function does. However, there's room for improvement.

Reading a large file into memory using `mmap(2)`

Recall that we allocated memory on the heap using operator new, and then used the syscall read(2) to populate that memory with the contents of our file:

read.cpp

46    char *Memory = static_cast<char *>(operator new(
47        FileSize + 1, std::nothrow));
48    Memory[FileSize] = 0;
..  
56    do {
57      ReadBytes = read(FD, Memory + Offset, ChunkSize);
..  
67      Offset += ReadBytes;
68    } while (ReadBytes != 0);

This allocation would be problematic if we had a huge file to read into memory. A file with a size of 1 gigabyte would result in 1 gigabyte of memory being allocated – that's a lot of RAM!

Thankfully, the syscall mmap(2) allows us to read in bits of the file at a time. Once again, Julia Evans explains it best with another great comic:

The mmap(2) syscall lazily loads files into memory.

I can modify the read.cpp program to use mmap(2) when reading from large files:

read.cpp

  5  #include <fcntl.h>
  +  #include <sys/mman.h>
  7  #include <sys/stat.h>
  8  #include <unistd.h>
  9  
 10  int main() {
 ..  
 ++    // For "large" files over 1024 bytes in size, I'll use
 ++    // syscall mmap(2).
 ++    char *Memory = nullptr;
 ++    bool UseMMap = (FileSize > 1024);
 ++    if (UseMMap) {
 ++      std::cout << "[NOTE] Using mmap" << std::endl;
 ++      int ProtectedOptions = PROT_READ;
 ++      int Flags = MAP_SHARED;
 ++      Memory = static_cast<char *>(mmap(nullptr, FileSize,
 ++                                        ProtectedOptions,
 ++                                        Flags, FD, 0));
 ++      if (Memory == MAP_FAILED) {
 ++        std::error_code Err(errno, std::generic_category());
 ++        std::cerr
 ++            << "[ERROR] Could not mmap file descriptor \""
 ++            << FD << "\": " << Err.message() << std::endl;
 ++      }
 ++    } else {
 ..      // ...use operator new as before.
 89    }
 90  
 91    // I've now read the file into memory.
 ++    // Note that this works exactly as before, we
 ++    // don't have to worry about whether it's an mmap:
 94    std::cout << "[NOTE] Here's the first line "
 95              << "of the file: \"";
 96    char *Ptr = Memory;
 97    while (*Ptr != '\n' && *Ptr != '\0') {
 98      std::cout << *Ptr;
 99      ++Ptr;
100    }
101    std::cout << "\"" << std::endl;
102  
+++    if (UseMMap) {
+++      // Once I'm done with the mmap'ed region, I need to
+++      // release it.
+++      munmap(Memory, FileSize);
+++    } else {
108      // Once I'm done with the file, I need to delete the
109      // memory I allocated, otherwise this is a memory leak.
110      delete Memory;
+++    }
...  
123    return 0;
124  }

Compiling and running this program produces the exact same results as before, with the important distinction that this program can open even very large files, without allocating a ton of memory.

To experiment, you could try adding millions of lines of comments to the bottom of read.cpp. Flip the (FileSize > 1024); conditional to < in order to use operator new, and you'll allocate hundreds of megabytes of memory up front. Then flip it back, to use mmap(2), and you'll allocate almost no memory.

For the most part, llvm::MemoryBuffer works exactly the same way as the read.cpp program above. It has a few extra bells and whistles, too: it works on both Unix and Windows, it uses a more complex hueristic to decide whether to use mmap(2) or not, and it uses some interesting syscalls and options on platforms that support them. I'll explain these as I write about it in detail below.

The LLVM implementation of `read.cpp`: `llvm::MemoryBuffer::getFileOrSTDIN`

Swift and Clang both use the llvm::MemoryBuffer::getFileOrSTDIN static member function to open input file arguments passed to them on the command-line. For example, below is the code in libswiftFrontend converts the string filenames it was passed on the command-line into llvm::MemoryBuffer objects. The filename is a std::string stored as swift::InputFile::file.

swift/lib/Frontend/Frontend.cpp

315  std::pair<std::unique_ptr<llvm::MemoryBuffer>,
316            std::unique_ptr<llvm::MemoryBuffer>>
317  CompilerInstance::getInputBufferAndModuleDocBufferIfPresent(
318      const InputFile &input) {
...  
326    using FileOrError = llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>>;
327    FileOrError inputFileOrErr = llvm::MemoryBuffer::getFileOrSTDIN(input.file());
328    if (!inputFileOrErr) {
329      Diagnostics.diagnose(SourceLoc(), diag::error_open_input_file, input.file(),
330                           inputFileOrErr.getError().message());
331      return std::make_pair(nullptr, nullptr);
332    }
...  
342  }

As I wrote in the previous article, these llvm::MemoryBuffer will then be passed over to the llvm::SourceMgr, which takes ownership of them. The swift::Parser will then interact with llvm::SourceMgr (or more precisely, a wrapper called swift::SourceManager) in order to emit diagnostics at particular locations in the buffer.

The llvm::MemoryBuffer::getFileOrSTDIN function returns either a std::unique_ptr to an llvm::MemoryBuffer for the given file, or an error. This is represented by the llvm::ErrorOr type. (I'll write more about llvm::ErrorOr in the future, but in the meantime you can watch this 5-minute lightning talk from LLVM Developers Meeting 2016 to learn more about them.)

The getFileOrSTDIN function just checks for a file name of "-" and then delegates its logic to either llvm::MemoryBuffer::getSTDIN or getFile. It may optionally be given an int64_t FileSize argument, but if not the default value of -1 signals the function to find out on its own – just as my example read.cpp program above did, by using the fstat system call.

llvm/include/llvm/Support/MemoryBuffer.h

125    /// Open the specified file as a MemoryBuffer, or open stdin if the Filename
126    /// is "-".
127    static ErrorOr<std::unique_ptr<MemoryBuffer>>
128    getFileOrSTDIN(const Twine &Filename, int64_t FileSize = -1,
129                   bool RequiresNullTerminator = true);

llvm/lib/Support/MemoryBuffer.cpp

 143  ErrorOr<std::unique_ptr<MemoryBuffer>>
 144  MemoryBuffer::getFileOrSTDIN(const Twine &Filename, int64_t FileSize,
 145                               bool RequiresNullTerminator) {
 146    SmallString<256> NameBuf;
 147    StringRef NameRef = Filename.toStringRef(NameBuf);
 148  
 149    if (NameRef == "-")
 150      return getSTDIN();
 151    return getFile(Filename, FileSize, RequiresNullTerminator);
 152  }

I'll focus on the getFile case for now, which delegates in turn to a function called getFileAux. The getFileAux static function implements some of the logic I implemented in the read.cpp example above: it opens the file in order to obtain a file descriptor, it reads that file, and then it calls close(2) in order to close the file descriptor:

llvm/include/llvm/Support/MemoryBuffer.h

73    /// Open the specified file as a MemoryBuffer, returning a new MemoryBuffer
74    /// if successful, otherwise returning null. If FileSize is specified, this
75    /// means that the client knows that the file exists and that it has the
76    /// specified size.
77    ///
78    /// \param IsVolatile Set to true to indicate that the contents of the file
79    /// can change outside the user's control, e.g. when libclang tries to parse
80    /// while the user is editing/updating the file or if the file is on an NFS.
81    static ErrorOr<std::unique_ptr<MemoryBuffer>>
82    getFile(const Twine &Filename, int64_t FileSize = -1,
83            bool RequiresNullTerminator = true, bool IsVolatile = false);

llvm/lib/Support/MemoryBuffer.cpp

229  ErrorOr<std::unique_ptr<MemoryBuffer>>
230  MemoryBuffer::getFile(const Twine &Filename, int64_t FileSize,
231                        bool RequiresNullTerminator, bool IsVolatile) {
232    return getFileAux<MemoryBuffer>(Filename, FileSize, FileSize, 0,
233                                    RequiresNullTerminator, IsVolatile);
234  }
...  
242  template <typename MB>
243  static ErrorOr<std::unique_ptr<MB>>
244  getFileAux(const Twine &Filename, int64_t FileSize, uint64_t MapSize,
245             uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) {
246    int FD;
247    std::error_code EC = sys::fs::openFileForRead(Filename, FD);
248  
249    if (EC)
250      return EC;
251  
252    auto Ret = getOpenFileImpl<MB>(FD, Filename, FileSize, MapSize, Offset,
253                                   RequiresNullTerminator, IsVolatile);
254    close(FD);
255    return Ret;
256  }

Unlike read.cpp, the getFileAux function does not call the open(2) system call directly in order to obtain an open file descriptor for given filename. Instead, it uses the llvm::sys::fs::openFileForRead function. This LLVM helper function, unlike open(2), works on both Windows and Unix platforms.

Per-platform implementations of system calls in LLVM

The llvm::sys::fs::openFileForRead function has a single delcaration, in the header file FileSystem.h:

llvm/include/llvm/Support/FileSystem.h

...  /// @brief Opens the file with the given name in a read-only mode, returning
...  /// its open file descriptor.
...  ///
...  /// @param Name The name of the file to open.
...  /// @param ResultFD The location to store the descriptor for the opened file.
...  /// @param RealPath If nonnull, extra work is done to determine the real path
...  ///                 of the opened file, and that path is stored in this
...  ///                 location.
...  /// @returns errc::success if \a Name has been opened, otherwise a
...  ///          platform-specific error_code.
822  std::error_code openFileForRead(const Twine &Name, int &ResultFD,
823                                  SmallVectorImpl<char> *RealPath = nullptr);

But the LLVM codebase defines two separate implementations of this function: one that's used on Windows platforms, and another that's used on Unix. It accomplishes this using CMake.

I've found that a working knowledge of CMake is a gift that really keeps on giving when it comes to compiler development. If you haven't already, you can read about it more in my articles The Swift Compiler's Build System and Reading and Understanding the CMake in apple/swift.

LLVM's root CMakeLists.txt file appends two directories to its modules path, and then includes one file from each of those directories: llvm/cmake/config-ix.cmake and llvm/cmake/modules/HandleLLVMOptions.cmake. Finally, it configures a header file named config.h.cmake:

llvm/CMakeLists.txt

184  set(CMAKE_MODULE_PATH
185    ${CMAKE_MODULE_PATH}
186    "${CMAKE_CURRENT_SOURCE_DIR}/cmake
187    "${CMAKE_CURRENT_SOURCE_DIR}/cmake/modules"
188    )
...  
588  include(config-ix)
...  
602  include(HandleLLVMOptions)
...  
737  configure_file(
738    ${LLVM_MAIN_INCLUDE_DIR}/llvm/Config/config.h.cmake
739    ${LLVM_INCLUDE_DIR}/llvm/Config/config.h)

The config-ix.cmake file uses the built-in CMake function check_symbol_exists in order to determine which system calls are available in the target build environment. For example, it checks whether pread is available and, if it is, has CMake define a variable named HAVE_PREAD:

llvm/cmake/config-ix.cmake

205  check_symbol_exists(pread unistd.h HAVE_PREAD)

Then, in HandleLLVMOptions.cmake, it uses the built-in CMake platform variables, WIN32 and UNIX, to set the CMake variables LLVM_ON_WIN32 and LLVM_ON_UNIX to True or False:

llvm/cmake/modules/HandleLLVMOptions.cmake

108  if(WIN32)
...  
114      set(LLVM_ON_WIN32 1)
115      set(LLVM_ON_UNIX 0)
...  
117  else(WIN32)
118    if(UNIX)
119      set(LLVM_ON_WIN32 0)
120      set(LLVM_ON_UNIX 1)
...  
129  endif(WIN32)

At this point, CMake variables like HAVE_PREAD and LLVM_ON_UNIX would only be visible from within CMake. To make their values visible in C++, the config.h.cmake file is configured via a call to the CMake built-in function configure_file, as shown in a code snippet above. The config.h.cmake file is full of #cmakedefine directives, which configure_file transforms into #define statements for consumption in C++. For example, config.h.cmake contains these #cmakedefine statements…

llvm/include/llvm/Config/config.h.cmake

142  /* Define to 1 if you have the `pread' function. */
143  #cmakedefine HAVE_PREAD ${HAVE_PREAD}
...  
311  /* Define if this is Unixish platform */
312  #cmakedefine LLVM_ON_UNIX ${LLVM_ON_UNIX}
313  
314  /* Define if this is Win32ish platform */
315  #cmakedefine LLVM_ON_WIN32 ${LLVM_ON_WIN32}

…which on a Unix-like platform, such as macOS, are transformed into these statements, placed in a file in the build directory named include/llvm/Config/config.h:

build/include/llvm/Config/config.h

142  /* Define to 1 if you have the `pread' function. */
143  #define HAVE_PREAD 1
...  
311  /* Define if this is Win32ish platform */
312  #define LLVM_ON_UNIX 1

And in llvm/lib/Support/Path.cpp, instead of finding an implementation of the llvm::sys::fs::openFileForRead function, instead there's a condiitonal include based on these definitions:

llvm/lib/Support/Path.cpp

1072  // Include the truly platform-specific parts.
1073  #if defined(LLVM_ON_UNIX)
1074  #include "Unix/Path.inc"
1075  #endif
1076  #if defined(LLVM_ON_WIN32)
1077  #include "Windows/Path.inc"
1078  #endif

It's in the included llvm/lib/Support/Unix/Path.inc file that I can find the actual implementation of llvm::sys::fs::openFileForRead that's used on Unix platforms.

Opening a file on Unix

As in the read.cpp example at the beginning of this article, the Unix implementation of the llvm::sys::fs::openFileForRead function uses the system call open(2) in order to open a file and get its file descriptor:

llvm/lib/Support/Unix/Path.inc

719  std::error_code openFileForRead(const Twine &Name, int &ResultFD,
720                                  SmallVectorImpl<char> *RealPath) {
721    SmallString<128> Storage;
722    StringRef P = Name.toNullTerminatedStringRef(Storage);
723    int OpenFlags = O_RDONLY;
724  #ifdef O_CLOEXEC
725    OpenFlags |= O_CLOEXEC;
726  #endif
727    if ((ResultFD = sys::RetryAfterSignal(-1, open, P.begin(), OpenFlags)) < 0)
728      return std::error_code(errno, std::generic_category());
729  #ifndef O_CLOEXEC
730    int r = fcntl(ResultFD, F_SETFD, FD_CLOEXEC);
731    (void)r;
732    assert(r == 0 && "fcntl(F_SETFD, FD_CLOEXEC) failed");
733  #endif
...  
758    return std::error_code();
759  }

The implementation above is long-winded because of two pieces of Unix trivia.

First off, instead of calling open(2) directly, it calls llvm::sys::RetryAfterSignal, which invokes open(2) in a while loop. This loop retries the open(2) call if it fails with an EINTR error code:

llvm/include/llvm/Support/Errno.h

33  template <typename FailT, typename Fun, typename... Args>
34  inline auto RetryAfterSignal(const FailT &Fail, const Fun &F,
35                               const Args &... As) -> decltype(F(As...)) {
36    decltype(F(As...)) Res;
37    do
38      Res = F(As...);
39    while (Res == Fail && errno == EINTR);
40    return Res;
41  }

I'm not a C++ expert. In case you aren't either, allow me to offer an explanation for the templates being used in the code above.

The RetryAfterSignal function has three template parameters:

const FailT &Fail, representing a value returned when the function call fails.

const Fun &F, representing the callable function.

A template parameter pack const Args &... As, representing the arguments passed to function F.

RetryAfterSignal uses the trailing return type syntax, of the form auto function -> return_type. Its return type is specified as decltype(F(As...)). In other words, the return type is the type returned by the expression F(As...).

To map this all to the concrete example we were looking at in llvm::sys::fs::openFileForRead, recall that function had the expression sys::RetryAfterSignal(-1, open, P.begin(), OpenFlags). Here -1 is the failure value const FailT &Fail, open is the function value const Fun &F, and (P.begin(), OpenFlags) are the template parameter pack arguments passed into the open function. The return type is the type returned by open(P.begin(), OpenFlags), which is int.

The llvm::sys::RetryAfterSignal function ignores the EINTR and retries because "blocking" Unix functions like open(2) and read(2) return EINTR whenever they are interrupted by a Unix signal. Interruptions like this can occur for all sorts of reasons, some of which you can read more about here. In these cases, LLVM will simply try again.

The other quirk in the llvm::sys::fs::openFileForRead implementation is the check for O_CLOEXEC, an open(2) flag that only exists on Linux 2.6.23 and above. This option has the OS automatically close the file descriptor if the process forks. If it's not available, the implementation uses the syscall fcntl in order to set a similar flag.

Reading the file into an `llvm::WritableMemoryBuffer`

The llvm::sys::fs::openFileForRead function opens a file and returns its file descriptor. Then control is returned back to the getFileAux function, which passes the open descriptor into the getOpenFileImpl static function:

llvm/lib/Support/MemoryBuffer.cpp

242  template <typename MB>
243  static ErrorOr<std::unique_ptr<MB>>
244  getFileAux(const Twine &Filename, int64_t FileSize, uint64_t MapSize,
245             uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) {
246    int FD;
247    std::error_code EC = sys::fs::openFileForRead(Filename, FD);
248  
249    if (EC)
250      return EC;
251  
252    auto Ret = getOpenFileImpl<MB>(FD, Filename, FileSize, MapSize, Offset,
253                                   RequiresNullTerminator, IsVolatile);
254    close(FD);
255    return Ret;
256  }

The getOpenFileImpl implements the same logic the read.cpp example at the beginning of this article did. If the file's size was not provided, it finds out how large the file is by calling llvm::sys::fs::status, which on Unix calls fstat. It then makes a decision as to whether to use mmap(2) or to allocate memory up front using operator new. If it allocates memory, then it uses the system call read(2) (or pread, if HAVE_PREAD is true) in order to read the bytes of the file into memory:

llvm/lib/Support/MemoryBuffer.cpp

416  template <typename MB>
417  static ErrorOr<std::unique_ptr<MB>>
418  getOpenFileImpl(int FD, const Twine &Filename, uint64_t FileSize,
419                  uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator,
420                  bool IsVolatile) {
421    static int PageSize = sys::Process::getPageSize();
422  
423    // Default is to map the full file.
424    if (MapSize == uint64_t(-1)) {
425      // If we don't know the file size, use fstat to find out.  fstat on an open
426      // file descriptor is cheaper than stat on a random path.
427      if (FileSize == uint64_t(-1)) {
428        sys::fs::file_status Status;
429        std::error_code EC = sys::fs::status(FD, Status);
430        if (EC)
431          return EC;
...  
441        FileSize = Status.getSize();
442      }
443      MapSize = FileSize;
444    }
445  
446    if (shouldUseMmap(FD, FileSize, MapSize, Offset, RequiresNullTerminator,
447                      PageSize, IsVolatile)) {
448      std::error_code EC;
449      std::unique_ptr<MB> Result(
450          new (NamedBufferAlloc(Filename)) MemoryBufferMMapFile<MB>(
451              RequiresNullTerminator, FD, MapSize, Offset, EC));
452      if (!EC)
453        return std::move(Result);
454    }
455  
456    auto Buf = WritableMemoryBuffer::getNewUninitMemBuffer(MapSize, Filename);
457    if (!Buf) {
458      // Failed to create a buffer. The only way it can fail is if
459      // new(std::nothrow) returns 0.
460      return make_error_code(errc::not_enough_memory);
461    }
462  
463    char *BufPtr = Buf.get()->getBufferStart();
464  
465    size_t BytesLeft = MapSize;
466  #ifndef HAVE_PREAD
467    if (lseek(FD, Offset, SEEK_SET) == -1)
468      return std::error_code(errno, std::generic_category());
469  #endif
470  
471    while (BytesLeft) {
472  #ifdef HAVE_PREAD
473      ssize_t NumRead = sys::RetryAfterSignal(-1, ::pread, FD, BufPtr, BytesLeft,
474                                              MapSize - BytesLeft + Offset);
475  #else
476      ssize_t NumRead = sys::RetryAfterSignal(-1, ::read, FD, BufPtr, BytesLeft);
477  #endif
478      if (NumRead == -1) {
479        // Error while reading.
480        return std::error_code(errno, std::generic_category());
481      }
482      if (NumRead == 0) {
483        memset(BufPtr, 0, BytesLeft); // zero-initialize rest of the buffer.
484        break;
485      }
486      BytesLeft -= NumRead;
487      BufPtr += NumRead;
488    }
489  
490    return std::move(Buf);
491  }

The functions llvm::sys::Process::getPageSize and llvm::sys::fs::status above use the same CMake tricks as llvm::sys::fs::openFileForRead did in order to include a platform-specific implementation: getPageSize is implemented in llvm/lib/Support/Unix/Process.inc and Windows/Process.inc, and status is implemented in Unix/Path.inc and Windows/Path.inc. On Unix they use system calls getpagesize and fstat in order to get the information they need from the operating system.

The code above instantiates either an llvm::MemoryBufferMMapFile or an llvm::WritableMemoryBuffer based on whether the helper function shouldUseMMap returns true or false. As it was in the read.cpp example at the beginning of this article, one criteria for that decision is the size of the file – for example, if it's smaller than a page on the system, or smaller than 16 kilobytes, then mmap(2) is not used:

llvm/lib/Support/MemoryBuffer.cpp

308  static bool shouldUseMmap(int FD,
309                            size_t FileSize,
310                            size_t MapSize,
311                            off_t Offset,
312                            bool RequiresNullTerminator,
313                            int PageSize,
314                            bool IsVolatile) {
...  
321    // We don't use mmap for small files because this can severely fragment our
322    // address space.
323    if (MapSize < 4 * 4096 || MapSize < (unsigned)PageSize)
324      return false;
...  
360    return true;
361  }

Assuming mmap(2) is not used, then the getOpenFileImpl function calls the static function llvm::WritableMemoryBuffer::getNewUninitMemBuffer. This function allocates the buffer memory just as the read.cpp example did, by using operator new. Unlike the read.cpp example program, however, this function not only allocates memory for a buffer to store the file's contents, it also allocates space for an instance of the llvm::MemoryBuffer class, and for the name of the file:

llvm/lib/Support/MemoryBuffer.cpp

273  std::unique_ptr<WritableMemoryBuffer>
274  WritableMemoryBuffer::getNewUninitMemBuffer(size_t Size, const Twine &BufferName) {
275    using MemBuffer = MemoryBufferMem<WritableMemoryBuffer>;
276    // Allocate space for the MemoryBuffer, the data and the name. It is important
277    // that MemoryBuffer and data are aligned so PointerIntPair works with them.
...  
280    SmallString<256> NameBuf;
281    StringRef NameRef = BufferName.toStringRef(NameBuf);
282    size_t AlignedStringLen = alignTo(sizeof(MemBuffer) + NameRef.size() + 1, 16);
283    size_t RealLen = AlignedStringLen + Size + 1;
284    char *Mem = static_cast<char*>(operator new(RealLen, std::nothrow));
285    if (!Mem)
286      return nullptr;
287  
288    // The name is stored after the class itself.
289    CopyStringRef(Mem + sizeof(MemBuffer), NameRef);
290  
291    // The buffer begins after the name and must be aligned.
292    char *Buf = Mem + AlignedStringLen;
293    Buf[Size] = 0; // Null terminate buffer.
294  
295    auto *Ret = new (Mem) MemBuffer(StringRef(Buf, Size), true);
296    return std::unique_ptr<WritableMemoryBuffer>(Ret);
297  }

Based on the code above, I can see that the memory that's being allocated here is laid out in three distinct segments:

The memory for <code><span class="t">llvm</span>::<span class="cl">MemoryBuffer</span></code> is laid out in three segments.

The first segment of memory allocated is sized such that an instance of llvm::MemoryBufferMem<llvm::WritableMemoryBuffer> could fit within it. Note that the size is calculated using sizeof(MemBuffer), and then the memory buffer is instantiated by calling new (Mem) MemBuffer(...). As I mentioned in my article on Getting Started with the Swift Frontend: Lexing & Parsing, this is a "placement" new operator call. It doesn't allocate any memory, and instead calls the MemBuffer constructor, and then places the constructed instance in the chunk of memory Mem. (You can read more about "placement new" here.)
The second segment of memory stores the name of the buffer. It's sized using the call to NameRef.size() above, and then the name is copied by calling the static helper function CopyStringRef.
Finally comes the rest of the buffer, which is the same size as the file being read into it.

The memory buffer allocated and returned by the llvm::WritableMemoryBuffer::getNewUninitMemBuffer function is an llvm::MemoryBufferMem<llvm::WritableMemoryBuffer>. MemoryBufferMem<T> is defined as a subclass of T. In this case, T is an llvm::WritableMemoryBuffer, which in turn derives from llvm::MemoryBuffer. The constructor of MemoryBufferMem calls through to llvm::MemoryBuffer::init:

llvm/lib/Support/MemoryBuffer.cpp

 83  /// MemoryBufferMem - Named MemoryBuffer pointing to a block of memory.
 84  template<typename MB>
 85  class MemoryBufferMem : public MB {
 86  public:
 87    MemoryBufferMem(StringRef InputData, bool RequiresNullTerminator) {
 88      MemoryBuffer::init(InputData.begin(), InputData.end(),
 89                         RequiresNullTerminator);
 90    }
 91  
 92    /// Disable sized deallocation for MemoryBufferMem, because it has
 93    /// tail-allocated data.
 94    void operator delete(void *p) { ::operator delete(p); }
...  
104  };

And the llvm::MemoryBuffer::init function simply sets private members pointing to the beginning and end of the buffer:

llvm/include/llvm/Support/MemoryBuffer.h

  42  class MemoryBuffer {
  43    const char *BufferStart; // Start of the buffer.
  44    const char *BufferEnd;   // End of the buffer.
  ..
 154  };

llvm/lib/Support/MemoryBuffer.cpp

 44  /// init - Initialize this MemoryBuffer as a reference to externally allocated
 45  /// memory, memory that we know is already null terminated.
 46  void MemoryBuffer::init(const char *BufStart, const char *BufEnd,
 47                          bool RequiresNullTerminator) {
 48    assert((!RequiresNullTerminator || BufEnd[0] == 0) &&
 49           "Buffer is not null terminated!");
 50    BufferStart = BufStart;
 51    BufferEnd = BufEnd;
 52  }

In summary, on a Unix system:

The llvm::MemoryBuffer::getFileOrSTDIN static function checks whether its been given a filename of "-" and, if it has, calls llvm::MemoryBuffer::getSTDIN. Otherwise, it calls llvm::MemoryBuffer::getFile.
llvm::MemoryBuffer::getFile calls through to getFileAux.
getFileAux gets an open file descriptor by calling llvm::sys::fs::openFileForRead, then getOpenFileImpl to instantiate a new llvm::MemoryBuffer and read in the contents of the file, and finally`close(2) in order to close the file descriptor.
getOpenFileImpl checks the file size and determines whether to use mmap(2). If mmap(2) is not used, then getOpenFileImpl allocates memory for an llvm::MemoryBuffer, its name, and its contents. It then reads in the contents of the file using read(2) or pread, depending on what's available on the operating system.

Mapping the file into an `llvm::MemoryBufferMMapFile`

Recall that getOpenFileImpl instantiates an llvm::MemoryBufferMMapFile if shouldUseMMap returns true:

llvm/lib/Support/MemoryBuffer.cpp

416  template <typename MB>
417  static ErrorOr<std::unique_ptr<MB>>
418  getOpenFileImpl(int FD, const Twine &Filename, uint64_t FileSize,
419                  uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator,
420                  bool IsVolatile) {
...  
446    if (shouldUseMmap(FD, FileSize, MapSize, Offset, RequiresNullTerminator,
447                      PageSize, IsVolatile)) {
448      std::error_code EC;
449      std::unique_ptr<MB> Result(
450          new (NamedBufferAlloc(Filename)) MemoryBufferMMapFile<MB>(
451              RequiresNullTerminator, FD, MapSize, Offset, EC));
452      if (!EC)
453        return std::move(Result);
454    }
455  
456    auto Buf = WritableMemoryBuffer::getNewUninitMemBuffer(MapSize, Filename);
...  
490    return std::move(Buf);
491  }

The llvm::MemoryBufferMMapFile class makes use of the llvm::sys::fs::mapped_file_region class, a wrapper around the mmap(2) and munmap system calls:

llvm/lib/Support/MemoryBuffer.cpp

166  /// \brief Memory maps a file descriptor using sys::fs::mapped_file_region.
167  ///
168  /// This handles converting the offset into a legal offset on the platform.
169  template<typename MB>
170  class MemoryBufferMMapFile : public MB {
171    sys::fs::mapped_file_region MFR;
...  
185  public:
186    MemoryBufferMMapFile(bool RequiresNullTerminator, int FD, uint64_t Len,
187                         uint64_t Offset, std::error_code &EC)
188        : MFR(FD, MB::Mapmode, getLegalMapSize(Len, Offset),
189              getLegalMapOffset(Offset), EC) {
190      if (!EC) {
191        const char *Start = getStart(Len, Offset);
192        MemoryBuffer::init(Start, Start + Len, RequiresNullTerminator);
193      }
194    }
...  
208  };

The mapped_file_region constructor calls mapped_file_region::init, which calls mmap(2). Its destructor calls munmap:

llvm/lib/Support/Unix/Path.inc

597  std::error_code mapped_file_region::init(int FD, uint64_t Offset,
598                                           mapmode Mode) {
...  
623    Mapping = ::mmap(nullptr, Size, prot, flags, FD, Offset);
624    if (Mapping == MAP_FAILED)
625      return std::error_code(errno, std::generic_category());
626    return std::error_code();
627  }
628  
629  mapped_file_region::mapped_file_region(int fd, mapmode mode, size_t length,
630                                         uint64_t offset, std::error_code &ec)
631      : Size(length), Mapping(), FD(fd), Mode(mode) {
...  
634    ec = init(fd, offset, mode);
635    if (ec)
636      Mapping = nullptr;
637  }
638  
639  mapped_file_region::~mapped_file_region() {
640    if (Mapping)
641      ::munmap(Mapping, Size);
642  }

What I learned

Looking into llvm::MemoryBuffer and how LLVM reads source files into memory taught me a lot:

At build time LLVM's CMake code determines which platform it's being built for. Based on this, it includes Unix- or Windows-specific implementations, such as llvm/lib/Support/Unix/Path.inc or Windows/Path.inc.
Also at build time LLVM CMake determines which system calls are available on the target platform. For example, if pread is available, then getOpenFileImpl will use pread to read the file into an llvm::WritableMemoryBuffer, instead of`read(2).
I can use mmap(2) to access the contents of a very large file without allocating a large amount of memory. LLVM's shouldUseMMap function references the file size, among other characteristics, to determine whether to use pre-allocated memory with llvm::WritableMemoryBuffer, or mmap(2) with llvm::MemoryBufferMMapFile.
llvm::MemoryBuffer maintains a buffer for the contents of a source file as a "trailing object" – a block of memory that is allocated when the class is constructed, but is not a member of the class itself. LLVM uses this trailing object pattern extensively. (It even defines an llvm::TrailingObjects class template, which I plan on writing more about in the future.)

Reading a file into memory using C++

read.cpp

Reading a large file into memory using mmap(2)

read.cpp

read.cpp

The LLVM implementation of read.cpp: llvm::MemoryBuffer::getFileOrSTDIN

Per-platform implementations of system calls in LLVM

build/include/llvm/Config/config.h

Opening a file on Unix

Reading the file into an llvm::WritableMemoryBuffer

Mapping the file into an llvm::MemoryBufferMMapFile

What I learned

Reading a large file into memory using `mmap(2)`

The LLVM implementation of `read.cpp`: `llvm::MemoryBuffer::getFileOrSTDIN`

Reading the file into an `llvm::WritableMemoryBuffer`

Mapping the file into an `llvm::MemoryBufferMMapFile`