How Swift and Clang Use LLVM to Read Files into Memory
The prior article in this series explained how the Swift and Clang compilers used llvm::SourceMgr
to emit diagnostics for source locations in memory buffers, represented by the class llvm::MemoryBuffer
. This article focuses on llvm::MemoryBuffer
, the primary abstraction for reading files and streams into memory. Since it's used by Swift, Clang, and LLVM tools like llvm-tblgen
, I found it valuable to understand how it works.
Reading a file into memory using C++
The documentation for libLLVMSupport's llvm::MemoryBuffer
class says it "provides simple read-only access to a block of memory, and provides simple methods for reading files and standard input into a memory buffer." To better understand how it does that, I tried writing a simple C++ program, called read.cpp
, that reads a file – itself, in this case – into memory. For simplicity's sake my program is only meant to operate on Unix systems.
My read.cpp
program reads a file into memory by using various system calls. These are requests made to the operating system for things like "open a file and give me its file descriptor," or "read 8 bytes from the file with this file descriptor." Julia Evans has a wonderful comic that explains them further:
My read.cpp
program uses four system calls:
open(2)
to get a file descriptor for the file.fstat
, which returns information about a file descriptor. Specifically,read.cpp
allocates memory based on the file's size.read(2)
, which reads a given number of bytes from a file into a pre-allocated block of memory.close(2)
to close a file descriptor once I'm done using it.
Once the read.cpp
program allocates memory and reads its own source file into that memory, it increments the char *
pointer into the memory and prints out the first line of the file:
read.cpp
1 #include <cerrno> 2 #include <iostream> 3 #include <system_error> 4 5 #include <fcntl.h> 6 #include <sys/stat.h> 7 #include <unistd.h> 8 9 int main() { 10 // I'll open this file itself and read it into memory. 11 auto FileName = __FILE__; 12 13 // The system call open(2) gets a file descriptor 14 // representing the open file. 15 int OpenFlags = O_RDONLY; 16 int FD = open(FileName, OpenFlags); 17 18 // open(2) returns a -1 if the file could not be opened. 19 // In this case, print an error and return. 20 if (FD < 0) { 21 std::error_code Err(errno, std::generic_category()); 22 std::cerr << "[ERROR] Could not open file \"" 23 << FileName << "\": " << Err.message() 24 << std::endl; 25 return 1; 26 } 27 28 // Syscall fstat populates the struct stat pointer 29 // with information about the given file descriptor, 30 // including the file's size in bytes. 31 struct stat Stat; 32 if (fstat(FD, &Stat) < 0) { 33 std::error_code Err(errno, std::generic_category()); 34 std::cerr << "[ERROR] Could not acquire information " 35 << "on file descriptor \"" << FD 36 << "\": " << Err.message() << std::endl; 37 return 1; 38 } 39 40 off_t FileSize = Stat.st_size; 41 std::cout << "[NOTE] File size: " << FileSize << " bytes" 42 << std::endl; 43 44 // Allocate memory in size equal to the number of bytes 45 // in the file. 46 char *Memory = static_cast<char *>(operator new( 47 FileSize + 1, std::nothrow)); 48 Memory[FileSize] = 0; 49 50 // Use syscall read(2) to read in bytes from the given 51 // file descriptor, into the prepared buffer, 16 bytes 52 // at a time. 53 const ssize_t ChunkSize = 16; 54 ssize_t Offset = 0; 55 ssize_t ReadBytes = 0; 56 do { 57 ReadBytes = read(FD, Memory + Offset, ChunkSize); 58 if (ReadBytes < 0) { 59 std::error_code Err(errno, std::generic_category()); 60 std::cerr << "[ERROR] Could not read from file " 61 "descriptor \"" 62 << FD << "\": " << Err.message() 63 << std::endl; 64 delete Memory; 65 return 1; 66 } 67 Offset += ReadBytes; 68 } while (ReadBytes != 0); 69 70 // I've now read the file into memory. To demonstrate: 71 std::cout << "[NOTE] Here's the first line " 72 << "of the file: \""; 73 char *Ptr = Memory; 74 while (*Ptr != '\n' && *Ptr != '\0') { 75 std::cout << *Ptr; 76 ++Ptr; 77 } 78 std::cout << "\"" << std::endl; 79 80 // Once I'm done with the file, I need to delete the 81 // memory I allocated, otherwise this is a memory leak. 82 delete Memory; 83 84 // Finally, I need to close the open file descriptor, 85 // using the system call close(2). 86 if (close(FD) < 0) { 87 std::error_code Err(errno, std::generic_category()); 88 std::cerr << "[ERROR] Could not close file " 89 << "descriptor \"" << FD << "\":" 90 << Err.message() << std::endl; 91 return 1; 92 } 93 94 return 0; 95 }
I can compile and run this program like so:
clang++ read.cpp -o my-read-example ./my-read-example [NOTE] File size: 2820 bytes [NOTE] Here's the first line of the file: "#include <cerrno>"
This is a good initial implementation of reading a file into memory in C++. In fact, this is very similar to what the llvm::MemoryBuffer::getFile
function does. However, there's room for improvement.
Reading a large file into memory using mmap(2)
Recall that we allocated memory on the heap using operator new
, and then used the syscall read(2)
to populate that memory with the contents of our file:
read.cpp
46 char *Memory = static_cast<char *>(operator new( 47 FileSize + 1, std::nothrow)); 48 Memory[FileSize] = 0; .. 56 do { 57 ReadBytes = read(FD, Memory + Offset, ChunkSize); .. 67 Offset += ReadBytes; 68 } while (ReadBytes != 0);
This allocation would be problematic if we had a huge file to read into memory. A file with a size of 1 gigabyte would result in 1 gigabyte of memory being allocated – that's a lot of RAM!
Thankfully, the syscall mmap(2)
allows us to read in bits of the file at a time. Once again, Julia Evans explains it best with another great comic:
I can modify the read.cpp
program to use mmap(2)
when reading from large files:
read.cpp
5 #include <fcntl.h>+ #include <sys/mman.h>7 #include <sys/stat.h> 8 #include <unistd.h> 9 10 int main() { ..++ // For "large" files over 1024 bytes in size, I'll use++ // syscall mmap(2).++ char *Memory = nullptr;++ bool UseMMap = (FileSize > 1024);++ if (UseMMap) {++ std::cout << "[NOTE] Using mmap" << std::endl;++ int ProtectedOptions = PROT_READ;++ int Flags = MAP_SHARED;++ Memory = static_cast<char *>(mmap(nullptr, FileSize,++ ProtectedOptions,++ Flags, FD, 0));++ if (Memory == MAP_FAILED) {++ std::error_code Err(errno, std::generic_category());++ std::cerr++ << "[ERROR] Could not mmap file descriptor \""++ << FD << "\": " << Err.message() << std::endl;++ }++ } else {.. // ...use operator new as before. 89 } 90 91 // I've now read the file into memory.++ // Note that this works exactly as before, we++ // don't have to worry about whether it's an mmap:94 std::cout << "[NOTE] Here's the first line " 95 << "of the file: \""; 96 char *Ptr = Memory; 97 while (*Ptr != '\n' && *Ptr != '\0') { 98 std::cout << *Ptr; 99 ++Ptr; 100 } 101 std::cout << "\"" << std::endl; 102+++ if (UseMMap) {+++ // Once I'm done with the mmap'ed region, I need to+++ // release it.+++ munmap(Memory, FileSize);+++ } else {108 // Once I'm done with the file, I need to delete the 109 // memory I allocated, otherwise this is a memory leak. 110 delete Memory;+++ }... 123 return 0; 124 }
Compiling and running this program produces the exact same results as before, with the important distinction that this program can open even very large files, without allocating a ton of memory.
To experiment, you could try adding millions of lines of comments to the bottom of
read.cpp
. Flip the(FileSize > 1024);
conditional to<
in order to useoperator new
, and you'll allocate hundreds of megabytes of memory up front. Then flip it back, to usemmap(2)
, and you'll allocate almost no memory.
For the most part, llvm::MemoryBuffer
works exactly the same way as the read.cpp
program above. It has a few extra bells and whistles, too: it works on both Unix and Windows, it uses a more complex hueristic to decide whether to use mmap(2)
or not, and it uses some interesting syscalls and options on platforms that support them. I'll explain these as I write about it in detail below.
The LLVM implementation of read.cpp
: llvm::MemoryBuffer::getFileOrSTDIN
Swift and Clang both use the llvm::MemoryBuffer::getFileOrSTDIN
static member function to open input file arguments passed to them on the command-line. For example, below is the code in libswiftFrontend converts the string filenames it was passed on the command-line into llvm::MemoryBuffer
objects. The filename is a std::string
stored as swift::InputFile::file
.
swift/lib/Frontend/Frontend.cpp
315 std::pair<std::unique_ptr<llvm::MemoryBuffer>, 316 std::unique_ptr<llvm::MemoryBuffer>> 317 CompilerInstance::getInputBufferAndModuleDocBufferIfPresent( 318 const InputFile &input) { ... 326 using FileOrError = llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>>; 327 FileOrError inputFileOrErr = llvm::MemoryBuffer::getFileOrSTDIN(input.file()); 328 if (!inputFileOrErr) { 329 Diagnostics.diagnose(SourceLoc(), diag::error_open_input_file, input.file(), 330 inputFileOrErr.getError().message()); 331 return std::make_pair(nullptr, nullptr); 332 } ... 342 }
As I wrote in the previous article, these
llvm::MemoryBuffer
will then be passed over to thellvm::SourceMgr
, which takes ownership of them. Theswift::Parser
will then interact withllvm::SourceMgr
(or more precisely, a wrapper calledswift::SourceManager
) in order to emit diagnostics at particular locations in the buffer.
The llvm::MemoryBuffer::getFileOrSTDIN
function returns either a std::unique_ptr
to an llvm::MemoryBuffer
for the given file, or an error. This is represented by the llvm::ErrorOr
type. (I'll write more about llvm::ErrorOr
in the future, but in the meantime you can watch this 5-minute lightning talk from LLVM Developers Meeting 2016 to learn more about them.)
The getFileOrSTDIN
function just checks for a file name of "-"
and then delegates its logic to either llvm::MemoryBuffer::getSTDIN
or getFile
. It may optionally be given an int64_t FileSize
argument, but if not the default value of -1
signals the function to find out on its own – just as my example read.cpp
program above did, by using the fstat
system call.
llvm/include/llvm/Support/MemoryBuffer.h
125 /// Open the specified file as a MemoryBuffer, or open stdin if the Filename 126 /// is "-". 127 static ErrorOr<std::unique_ptr<MemoryBuffer>> 128 getFileOrSTDIN(const Twine &Filename, int64_t FileSize = -1, 129 bool RequiresNullTerminator = true);
llvm/lib/Support/MemoryBuffer.cpp
143 ErrorOr<std::unique_ptr<MemoryBuffer>> 144 MemoryBuffer::getFileOrSTDIN(const Twine &Filename, int64_t FileSize, 145 bool RequiresNullTerminator) { 146 SmallString<256> NameBuf; 147 StringRef NameRef = Filename.toStringRef(NameBuf); 148 149 if (NameRef == "-") 150 return getSTDIN(); 151 return getFile(Filename, FileSize, RequiresNullTerminator); 152 }
I'll focus on the getFile
case for now, which delegates in turn to a function called getFileAux
. The getFileAux
static function implements some of the logic I implemented in the read.cpp
example above: it opens the file in order to obtain a file descriptor, it reads that file, and then it calls close(2)
in order to close the file descriptor:
llvm/include/llvm/Support/MemoryBuffer.h
73 /// Open the specified file as a MemoryBuffer, returning a new MemoryBuffer 74 /// if successful, otherwise returning null. If FileSize is specified, this 75 /// means that the client knows that the file exists and that it has the 76 /// specified size. 77 /// 78 /// \param IsVolatile Set to true to indicate that the contents of the file 79 /// can change outside the user's control, e.g. when libclang tries to parse 80 /// while the user is editing/updating the file or if the file is on an NFS. 81 static ErrorOr<std::unique_ptr<MemoryBuffer>> 82 getFile(const Twine &Filename, int64_t FileSize = -1, 83 bool RequiresNullTerminator = true, bool IsVolatile = false);
llvm/lib/Support/MemoryBuffer.cpp
229 ErrorOr<std::unique_ptr<MemoryBuffer>> 230 MemoryBuffer::getFile(const Twine &Filename, int64_t FileSize, 231 bool RequiresNullTerminator, bool IsVolatile) { 232 return getFileAux<MemoryBuffer>(Filename, FileSize, FileSize, 0, 233 RequiresNullTerminator, IsVolatile); 234 } ... 242 template <typename MB> 243 static ErrorOr<std::unique_ptr<MB>> 244 getFileAux(const Twine &Filename, int64_t FileSize, uint64_t MapSize, 245 uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) { 246 int FD; 247 std::error_code EC = sys::fs::openFileForRead(Filename, FD); 248 249 if (EC) 250 return EC; 251 252 auto Ret = getOpenFileImpl<MB>(FD, Filename, FileSize, MapSize, Offset, 253 RequiresNullTerminator, IsVolatile); 254 close(FD); 255 return Ret; 256 }
Unlike read.cpp
, the getFileAux
function does not call the open(2)
system call directly in order to obtain an open file descriptor for given filename. Instead, it uses the llvm::sys::fs::openFileForRead
function. This LLVM helper function, unlike open(2)
, works on both Windows and Unix platforms.
Per-platform implementations of system calls in LLVM
The llvm::sys::fs::openFileForRead
function has a single delcaration, in the header file FileSystem.h
:
llvm/include/llvm/Support/FileSystem.h
... /// @brief Opens the file with the given name in a read-only mode, returning ... /// its open file descriptor. ... /// ... /// @param Name The name of the file to open. ... /// @param ResultFD The location to store the descriptor for the opened file. ... /// @param RealPath If nonnull, extra work is done to determine the real path ... /// of the opened file, and that path is stored in this ... /// location. ... /// @returns errc::success if \a Name has been opened, otherwise a ... /// platform-specific error_code. 822 std::error_code openFileForRead(const Twine &Name, int &ResultFD, 823 SmallVectorImpl<char> *RealPath = nullptr);
But the LLVM codebase defines two separate implementations of this function: one that's used on Windows platforms, and another that's used on Unix. It accomplishes this using CMake.
I've found that a working knowledge of CMake is a gift that really keeps on giving when it comes to compiler development. If you haven't already, you can read about it more in my articles The Swift Compiler's Build System and Reading and Understanding the CMake in apple/swift.
LLVM's root CMakeLists.txt
file appends two directories to its modules path, and then includes one file from each of those directories: llvm/cmake/config-ix.cmake
and llvm/cmake/modules/HandleLLVMOptions.cmake
. Finally, it configures a header file named config.h.cmake
:
llvm/CMakeLists.txt
184 set(CMAKE_MODULE_PATH 185 ${CMAKE_MODULE_PATH} 186 "${CMAKE_CURRENT_SOURCE_DIR}/cmake 187 "${CMAKE_CURRENT_SOURCE_DIR}/cmake/modules" 188 ) ... 588 include(config-ix) ... 602 include(HandleLLVMOptions) ... 737 configure_file( 738 ${LLVM_MAIN_INCLUDE_DIR}/llvm/Config/config.h.cmake 739 ${LLVM_INCLUDE_DIR}/llvm/Config/config.h)
The config-ix.cmake
file uses the built-in CMake function check_symbol_exists
in order to determine which system calls are available in the target build environment. For example, it checks whether pread
is available and, if it is, has CMake define a variable named HAVE_PREAD
:
llvm/cmake/config-ix.cmake
205 check_symbol_exists(pread unistd.h HAVE_PREAD)
Then, in HandleLLVMOptions.cmake
, it uses the built-in CMake platform variables, WIN32
and UNIX
, to set the CMake variables LLVM_ON_WIN32
and LLVM_ON_UNIX
to True
or False
:
llvm/cmake/modules/HandleLLVMOptions.cmake
108 if(WIN32) ... 114 set(LLVM_ON_WIN32 1) 115 set(LLVM_ON_UNIX 0) ... 117 else(WIN32) 118 if(UNIX) 119 set(LLVM_ON_WIN32 0) 120 set(LLVM_ON_UNIX 1) ... 129 endif(WIN32)
At this point, CMake variables like HAVE_PREAD
and LLVM_ON_UNIX
would only be visible from within CMake. To make their values visible in C++, the config.h.cmake
file is configured via a call to the CMake built-in function configure_file
, as shown in a code snippet above. The config.h.cmake
file is full of #cmakedefine
directives, which configure_file
transforms into #define
statements for consumption in C++. For example, config.h.cmake
contains these #cmakedefine
statements…
llvm/include/llvm/Config/config.h.cmake
142 /* Define to 1 if you have the `pread' function. */ 143 #cmakedefine HAVE_PREAD ${HAVE_PREAD} ... 311 /* Define if this is Unixish platform */ 312 #cmakedefine LLVM_ON_UNIX ${LLVM_ON_UNIX} 313 314 /* Define if this is Win32ish platform */ 315 #cmakedefine LLVM_ON_WIN32 ${LLVM_ON_WIN32}
…which on a Unix-like platform, such as macOS, are transformed into these statements, placed in a file in the build directory named include/llvm/Config/config.h
:
build/include/llvm/Config/config.h
142 /* Define to 1 if you have the `pread' function. */ 143 #define HAVE_PREAD 1 ... 311 /* Define if this is Win32ish platform */ 312 #define LLVM_ON_UNIX 1
And in llvm/lib/Support/Path.cpp
, instead of finding an implementation of the llvm::sys::fs::openFileForRead
function, instead there's a condiitonal include based on these definitions:
llvm/lib/Support/Path.cpp
1072 // Include the truly platform-specific parts. 1073 #if defined(LLVM_ON_UNIX) 1074 #include "Unix/Path.inc" 1075 #endif 1076 #if defined(LLVM_ON_WIN32) 1077 #include "Windows/Path.inc" 1078 #endif
It's in the included llvm/lib/Support/Unix/Path.inc
file that I can find the actual implementation of llvm::sys::fs::openFileForRead
that's used on Unix platforms.
Opening a file on Unix
As in the read.cpp
example at the beginning of this article, the Unix implementation of the llvm::sys::fs::openFileForRead
function uses the system call open(2)
in order to open a file and get its file descriptor:
llvm/lib/Support/Unix/Path.inc
719 std::error_code openFileForRead(const Twine &Name, int &ResultFD, 720 SmallVectorImpl<char> *RealPath) { 721 SmallString<128> Storage; 722 StringRef P = Name.toNullTerminatedStringRef(Storage); 723 int OpenFlags = O_RDONLY; 724 #ifdef O_CLOEXEC 725 OpenFlags |= O_CLOEXEC; 726 #endif 727 if ((ResultFD = sys::RetryAfterSignal(-1, open, P.begin(), OpenFlags)) < 0) 728 return std::error_code(errno, std::generic_category()); 729 #ifndef O_CLOEXEC 730 int r = fcntl(ResultFD, F_SETFD, FD_CLOEXEC); 731 (void)r; 732 assert(r == 0 && "fcntl(F_SETFD, FD_CLOEXEC) failed"); 733 #endif ... 758 return std::error_code(); 759 }
The implementation above is long-winded because of two pieces of Unix trivia.
First off, instead of calling open(2)
directly, it calls llvm::sys::RetryAfterSignal
, which invokes open(2)
in a while
loop. This loop retries the open(2)
call if it fails with an EINTR
error code:
llvm/include/llvm/Support/Errno.h
33 template <typename FailT, typename Fun, typename... Args> 34 inline auto RetryAfterSignal(const FailT &Fail, const Fun &F, 35 const Args &... As) -> decltype(F(As...)) { 36 decltype(F(As...)) Res; 37 do 38 Res = F(As...); 39 while (Res == Fail && errno == EINTR); 40 return Res; 41 }
I'm not a C++ expert. In case you aren't either, allow me to offer an explanation for the templates being used in the code above.
The
RetryAfterSignal
function has three template parameters:
const FailT &Fail
, representing a value returned when the function call fails.const Fun &F
, representing the callable function.- A template parameter pack
const Args &... As
, representing the arguments passed to functionF
.
RetryAfterSignal
uses the trailing return type syntax, of the formauto function -> return_type
. Its return type is specified asdecltype(F(As...))
. In other words, the return type is the type returned by the expressionF(As...)
.To map this all to the concrete example we were looking at in
llvm::sys::fs::openFileForRead
, recall that function had the expressionsys::RetryAfterSignal(-1, open, P.begin(), OpenFlags)
. Here-1
is the failure valueconst FailT &Fail
,open
is the function valueconst Fun &F
, and(P.begin(), OpenFlags)
are the template parameter pack arguments passed into theopen
function. The return type is the type returned byopen(P.begin(), OpenFlags)
, which isint
.
The llvm::sys::RetryAfterSignal
function ignores the EINTR
and retries because "blocking" Unix functions like open(2)
and read(2)
return EINTR
whenever they are interrupted by a Unix signal. Interruptions like this can occur for all sorts of reasons, some of which you can read more about here. In these cases, LLVM will simply try again.
The other quirk in the llvm::sys::fs::openFileForRead
implementation is the check for O_CLOEXEC
, an open(2)
flag that only exists on Linux 2.6.23 and above. This option has the OS automatically close the file descriptor if the process forks. If it's not available, the implementation uses the syscall fcntl
in order to set a similar flag.
Reading the file into an llvm::WritableMemoryBuffer
The llvm::sys::fs::openFileForRead
function opens a file and returns its file descriptor. Then control is returned back to the getFileAux
function, which passes the open descriptor into the getOpenFileImpl
static function:
llvm/lib/Support/MemoryBuffer.cpp
242 template <typename MB> 243 static ErrorOr<std::unique_ptr<MB>> 244 getFileAux(const Twine &Filename, int64_t FileSize, uint64_t MapSize, 245 uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) { 246 int FD; 247 std::error_code EC = sys::fs::openFileForRead(Filename, FD); 248 249 if (EC) 250 return EC; 251 252 auto Ret = getOpenFileImpl<MB>(FD, Filename, FileSize, MapSize, Offset, 253 RequiresNullTerminator, IsVolatile); 254 close(FD); 255 return Ret; 256 }
The getOpenFileImpl
implements the same logic the read.cpp
example at the beginning of this article did. If the file's size was not provided, it finds out how large the file is by calling llvm::sys::fs::status
, which on Unix calls fstat
. It then makes a decision as to whether to use mmap(2)
or to allocate memory up front using operator new
. If it allocates memory, then it uses the system call read(2)
(or pread
, if HAVE_PREAD
is true) in order to read the bytes of the file into memory:
llvm/lib/Support/MemoryBuffer.cpp
416 template <typename MB> 417 static ErrorOr<std::unique_ptr<MB>> 418 getOpenFileImpl(int FD, const Twine &Filename, uint64_t FileSize, 419 uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator, 420 bool IsVolatile) { 421 static int PageSize = sys::Process::getPageSize(); 422 423 // Default is to map the full file. 424 if (MapSize == uint64_t(-1)) { 425 // If we don't know the file size, use fstat to find out. fstat on an open 426 // file descriptor is cheaper than stat on a random path. 427 if (FileSize == uint64_t(-1)) { 428 sys::fs::file_status Status; 429 std::error_code EC = sys::fs::status(FD, Status); 430 if (EC) 431 return EC; ... 441 FileSize = Status.getSize(); 442 } 443 MapSize = FileSize; 444 } 445 446 if (shouldUseMmap(FD, FileSize, MapSize, Offset, RequiresNullTerminator, 447 PageSize, IsVolatile)) { 448 std::error_code EC; 449 std::unique_ptr<MB> Result( 450 new (NamedBufferAlloc(Filename)) MemoryBufferMMapFile<MB>( 451 RequiresNullTerminator, FD, MapSize, Offset, EC)); 452 if (!EC) 453 return std::move(Result); 454 } 455 456 auto Buf = WritableMemoryBuffer::getNewUninitMemBuffer(MapSize, Filename); 457 if (!Buf) { 458 // Failed to create a buffer. The only way it can fail is if 459 // new(std::nothrow) returns 0. 460 return make_error_code(errc::not_enough_memory); 461 } 462 463 char *BufPtr = Buf.get()->getBufferStart(); 464 465 size_t BytesLeft = MapSize; 466 #ifndef HAVE_PREAD 467 if (lseek(FD, Offset, SEEK_SET) == -1) 468 return std::error_code(errno, std::generic_category()); 469 #endif 470 471 while (BytesLeft) { 472 #ifdef HAVE_PREAD 473 ssize_t NumRead = sys::RetryAfterSignal(-1, ::pread, FD, BufPtr, BytesLeft, 474 MapSize - BytesLeft + Offset); 475 #else 476 ssize_t NumRead = sys::RetryAfterSignal(-1, ::read, FD, BufPtr, BytesLeft); 477 #endif 478 if (NumRead == -1) { 479 // Error while reading. 480 return std::error_code(errno, std::generic_category()); 481 } 482 if (NumRead == 0) { 483 memset(BufPtr, 0, BytesLeft); // zero-initialize rest of the buffer. 484 break; 485 } 486 BytesLeft -= NumRead; 487 BufPtr += NumRead; 488 } 489 490 return std::move(Buf); 491 }
The functions llvm::sys::Process::getPageSize
and llvm::sys::fs::status
above use the same CMake tricks as llvm::sys::fs::openFileForRead
did in order to include a platform-specific implementation: getPageSize
is implemented in llvm/lib/Support/Unix/Process.inc
and Windows/Process.inc
, and status
is implemented in Unix/Path.inc
and Windows/Path.inc
. On Unix they use system calls getpagesize
and fstat
in order to get the information they need from the operating system.
The code above instantiates either an llvm::MemoryBufferMMapFile
or an llvm::WritableMemoryBuffer
based on whether the helper function shouldUseMMap
returns true
or false
. As it was in the read.cpp
example at the beginning of this article, one criteria for that decision is the size of the file – for example, if it's smaller than a page on the system, or smaller than 16 kilobytes, then mmap(2)
is not used:
llvm/lib/Support/MemoryBuffer.cpp
308 static bool shouldUseMmap(int FD, 309 size_t FileSize, 310 size_t MapSize, 311 off_t Offset, 312 bool RequiresNullTerminator, 313 int PageSize, 314 bool IsVolatile) { ... 321 // We don't use mmap for small files because this can severely fragment our 322 // address space. 323 if (MapSize < 4 * 4096 || MapSize < (unsigned)PageSize) 324 return false; ... 360 return true; 361 }
Assuming mmap(2)
is not used, then the getOpenFileImpl
function calls the static function llvm::WritableMemoryBuffer::getNewUninitMemBuffer
. This function allocates the buffer memory just as the read.cpp
example did, by using operator new
. Unlike the read.cpp
example program, however, this function not only allocates memory for a buffer to store the file's contents, it also allocates space for an instance of the llvm::MemoryBuffer
class, and for the name of the file:
llvm/lib/Support/MemoryBuffer.cpp
273 std::unique_ptr<WritableMemoryBuffer> 274 WritableMemoryBuffer::getNewUninitMemBuffer(size_t Size, const Twine &BufferName) { 275 using MemBuffer = MemoryBufferMem<WritableMemoryBuffer>; 276 // Allocate space for the MemoryBuffer, the data and the name. It is important 277 // that MemoryBuffer and data are aligned so PointerIntPair works with them. ... 280 SmallString<256> NameBuf; 281 StringRef NameRef = BufferName.toStringRef(NameBuf); 282 size_t AlignedStringLen = alignTo(sizeof(MemBuffer) + NameRef.size() + 1, 16); 283 size_t RealLen = AlignedStringLen + Size + 1; 284 char *Mem = static_cast<char*>(operator new(RealLen, std::nothrow)); 285 if (!Mem) 286 return nullptr; 287 288 // The name is stored after the class itself. 289 CopyStringRef(Mem + sizeof(MemBuffer), NameRef); 290 291 // The buffer begins after the name and must be aligned. 292 char *Buf = Mem + AlignedStringLen; 293 Buf[Size] = 0; // Null terminate buffer. 294 295 auto *Ret = new (Mem) MemBuffer(StringRef(Buf, Size), true); 296 return std::unique_ptr<WritableMemoryBuffer>(Ret); 297 }
Based on the code above, I can see that the memory that's being allocated here is laid out in three distinct segments:
- The first segment of memory allocated is sized such that an instance of
llvm::MemoryBufferMem<llvm::WritableMemoryBuffer>
could fit within it. Note that the size is calculated usingsizeof(MemBuffer)
, and then the memory buffer is instantiated by callingnew (Mem) MemBuffer(...)
. As I mentioned in my article on Getting Started with the Swift Frontend: Lexing & Parsing, this is a "placement"new
operator call. It doesn't allocate any memory, and instead calls theMemBuffer
constructor, and then places the constructed instance in the chunk of memoryMem
. (You can read more about "placement new" here.) - The second segment of memory stores the name of the buffer. It's sized using the call to
NameRef.size()
above, and then the name is copied by calling the static helper functionCopyStringRef
. - Finally comes the rest of the buffer, which is the same size as the file being read into it.
The memory buffer allocated and returned by the llvm::WritableMemoryBuffer::getNewUninitMemBuffer
function is an llvm::MemoryBufferMem<llvm::WritableMemoryBuffer>
. MemoryBufferMem<T>
is defined as a subclass of T
. In this case, T
is an llvm::WritableMemoryBuffer
, which in turn derives from llvm::MemoryBuffer
. The constructor of MemoryBufferMem calls through to llvm::MemoryBuffer::init
:
llvm/lib/Support/MemoryBuffer.cpp
83 /// MemoryBufferMem - Named MemoryBuffer pointing to a block of memory. 84 template<typename MB> 85 class MemoryBufferMem : public MB { 86 public: 87 MemoryBufferMem(StringRef InputData, bool RequiresNullTerminator) { 88 MemoryBuffer::init(InputData.begin(), InputData.end(), 89 RequiresNullTerminator); 90 } 91 92 /// Disable sized deallocation for MemoryBufferMem, because it has 93 /// tail-allocated data. 94 void operator delete(void *p) { ::operator delete(p); } ... 104 };
And the llvm::MemoryBuffer::init
function simply sets private members pointing to the beginning and end of the buffer:
llvm/include/llvm/Support/MemoryBuffer.h
42 class MemoryBuffer { 43 const char *BufferStart; // Start of the buffer. 44 const char *BufferEnd; // End of the buffer. .. 154 };
llvm/lib/Support/MemoryBuffer.cpp
44 /// init - Initialize this MemoryBuffer as a reference to externally allocated 45 /// memory, memory that we know is already null terminated. 46 void MemoryBuffer::init(const char *BufStart, const char *BufEnd, 47 bool RequiresNullTerminator) { 48 assert((!RequiresNullTerminator || BufEnd[0] == 0) && 49 "Buffer is not null terminated!"); 50 BufferStart = BufStart; 51 BufferEnd = BufEnd; 52 }
In summary, on a Unix system:
- The
llvm::MemoryBuffer::getFileOrSTDIN
static function checks whether its been given a filename of"-"
and, if it has, callsllvm::MemoryBuffer::getSTDIN
. Otherwise, it callsllvm::MemoryBuffer::getFile
. llvm::MemoryBuffer::getFile
calls through togetFileAux
.getFileAux
gets an open file descriptor by callingllvm::sys::fs::openFileForRead
, thengetOpenFileImpl
to instantiate a newllvm::MemoryBuffer
and read in the contents of the file, and finally`close(2)
in order to close the file descriptor.getOpenFileImpl
checks the file size and determines whether to usemmap(2)
. Ifmmap(2)
is not used, thengetOpenFileImpl
allocates memory for anllvm::MemoryBuffer
, its name, and its contents. It then reads in the contents of the file usingread(2)
orpread
, depending on what's available on the operating system.
Mapping the file into an llvm::MemoryBufferMMapFile
Recall that getOpenFileImpl
instantiates an llvm::MemoryBufferMMapFile
if shouldUseMMap
returns true
:
llvm/lib/Support/MemoryBuffer.cpp
416 template <typename MB> 417 static ErrorOr<std::unique_ptr<MB>> 418 getOpenFileImpl(int FD, const Twine &Filename, uint64_t FileSize, 419 uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator, 420 bool IsVolatile) { ... 446 if (shouldUseMmap(FD, FileSize, MapSize, Offset, RequiresNullTerminator, 447 PageSize, IsVolatile)) { 448 std::error_code EC; 449 std::unique_ptr<MB> Result( 450 new (NamedBufferAlloc(Filename)) MemoryBufferMMapFile<MB>( 451 RequiresNullTerminator, FD, MapSize, Offset, EC)); 452 if (!EC) 453 return std::move(Result); 454 } 455 456 auto Buf = WritableMemoryBuffer::getNewUninitMemBuffer(MapSize, Filename); ... 490 return std::move(Buf); 491 }
The llvm::MemoryBufferMMapFile
class makes use of the llvm::sys::fs::mapped_file_region
class, a wrapper around the mmap(2)
and munmap
system calls:
llvm/lib/Support/MemoryBuffer.cpp
166 /// \brief Memory maps a file descriptor using sys::fs::mapped_file_region. 167 /// 168 /// This handles converting the offset into a legal offset on the platform. 169 template<typename MB> 170 class MemoryBufferMMapFile : public MB { 171 sys::fs::mapped_file_region MFR; ... 185 public: 186 MemoryBufferMMapFile(bool RequiresNullTerminator, int FD, uint64_t Len, 187 uint64_t Offset, std::error_code &EC) 188 : MFR(FD, MB::Mapmode, getLegalMapSize(Len, Offset), 189 getLegalMapOffset(Offset), EC) { 190 if (!EC) { 191 const char *Start = getStart(Len, Offset); 192 MemoryBuffer::init(Start, Start + Len, RequiresNullTerminator); 193 } 194 } ... 208 };
The mapped_file_region
constructor calls mapped_file_region::init
, which calls mmap(2)
. Its destructor calls munmap
:
llvm/lib/Support/Unix/Path.inc
597 std::error_code mapped_file_region::init(int FD, uint64_t Offset, 598 mapmode Mode) { ... 623 Mapping = ::mmap(nullptr, Size, prot, flags, FD, Offset); 624 if (Mapping == MAP_FAILED) 625 return std::error_code(errno, std::generic_category()); 626 return std::error_code(); 627 } 628 629 mapped_file_region::mapped_file_region(int fd, mapmode mode, size_t length, 630 uint64_t offset, std::error_code &ec) 631 : Size(length), Mapping(), FD(fd), Mode(mode) { ... 634 ec = init(fd, offset, mode); 635 if (ec) 636 Mapping = nullptr; 637 } 638 639 mapped_file_region::~mapped_file_region() { 640 if (Mapping) 641 ::munmap(Mapping, Size); 642 }
What I learned
Looking into llvm::MemoryBuffer
and how LLVM reads source files into memory taught me a lot:
- At build time LLVM's CMake code determines which platform it's being built for. Based on this, it includes Unix- or Windows-specific implementations, such as
llvm/lib/Support/Unix/Path.inc
orWindows/Path.inc
. - Also at build time LLVM CMake determines which system calls are available on the target platform. For example, if
pread
is available, thengetOpenFileImpl
will usepread
to read the file into anllvm::WritableMemoryBuffer
, instead of`read(2)
. - I can use
mmap(2)
to access the contents of a very large file without allocating a large amount of memory. LLVM'sshouldUseMMap
function references the file size, among other characteristics, to determine whether to use pre-allocated memory withllvm::WritableMemoryBuffer
, ormmap(2)
withllvm::MemoryBufferMMapFile
. llvm::MemoryBuffer
maintains a buffer for the contents of a source file as a "trailing object" – a block of memory that is allocated when the class is constructed, but is not a member of the class itself. LLVM uses this trailing object pattern extensively. (It even defines anllvm::TrailingObjects
class template, which I plan on writing more about in the future.)