Episode 0: `mov fp, sp`

00:00:00 Intro

Website: tlbh.it
Twitter: @tlbhit
This episode on Apple podcast
The stack pretty much always TLB hits!

00:00:59 Disclaimer

We're lifelong learners, only know so much!
Will put errata up on the TLB Hit website
"Sidechannels" via Twitter

00:01:42 What's the stack?

Episode is named mov fp sp
mov fp sp in the prologue of functions
Epilogue has "reverse" mov sp fp
Instructions that manipulate the stack!
Compiler spills values that registers can't hold onto the stack
Functions do it a lot -- have their own [local] state, call functions that have their own state
Because subroutines can recurse without bounds would need unbounded number of registers
Often different kinds of registers: arithmetic value registers, floating point registers
Registers contain fairly arbitrary "stuff": pointers to data, pointers to code, return addresses, etc.
Stack is usually contiguous and allocated on a per-thread basis
Idea of "GPRs": general purpose registers, though some machines have dedicated registers for floating point values as well, or SIMD for really wide
Prologue moves stack pointer to base pointer, epilogue moves base pointer back to stack pointer, "undoing", locally manipulating the stack pointer then rolling things back to where they previously were

00:03:50 Mechanisms in the processor

Frame pointer/base pointer (bp/fp), stack pointer (sp)
Usual convention is that the frame pointer doesn't change during the course of the function's execution
Generated code addresses "slots" relative (at offsets from) the frame pointer; e.g. +4, +8, etc.
Stack is kind of like a linked list! Pointer of the stack that says "this is where the frame pointer used to be before we came into this routine".

00:05:10 Comparison to an abstract stack machine

In CS class you may learn about machines where you push two operands onto a stack then do an add operation that consumes the top two things on the stack
Compare to traditional processor we use today: expanding the stack as a single operation that makes a bunch of slots at once
The slots don't need to be consumed in a strictly stack-order fashion
Distinction of "stack machine" vs scratchpad-area style frame areas that happen in stack-like fashion for subroutine calls

00:06:05 Some instruction set considerations

Considerations on modern machines for frequency of these operations and how they fit in our instruction cache; e.g. on x86 push/pop are single byte opcodes
On ARM we may have a "push multiple values" instruction; little CISC-y but you do so commonly it may make some sense
ARMv7 had instruction allowed to push 16 registers (all GPRs) and increment stack pointer. Yay RISC!

00:07:09 Compiler optimizations and stackiness

By moving things onto the stack -- code is constantly working with the things in its stack frame
Locality, but also avoiding memory allocation subroutines (100s or 1000s of cycle depending)
In scratchpad area values are tracked precisely in dataflow sort of style
Bring them "in" to the compiler, values becomes more trackable
SSA values vs arbitrary memory references
When structs are brought onto the stack the individual fields inside can be broken apart and the component fields can be tracked as individual values
Often called "scalar replacement of aggregates" (e.g. in LLVM)
When we home them on the stack we can do our common optimizations, CSE, DCE; if on the heap, may be a lot harder to to do
In managed languages (e.g. JavaScript, Java) would do escape analysis to show it doesn't escape via heap to an unknown subroutine -- once placed on the stack you can eliminate whole objects and just track sub-fields inside of it
Allows you to just "explode" the object itself and think about its component fields individually and get rid of whatever doesn't matter in there

00:09:17 Eliding heap allocations in C++

Some compilers can also sometimes optimize local heap allocations, turn them into stacky allocation
C++ explicitly allows you to do that as of a few years ago, Clang does that
If you new an object no guarantee that you're actually going to put it on the heap / call the underlying allocator
Can be surprising to people -- can do SRoA, other stuff, might get rid of the entire computation
Neat, unless it's not what you're trying to do
But seems like a key optimization to do
If you're thinking about things as objects instead of raw bytes having higher level understanding you can optimize based off of is pretty key it seems?

00:10:18 Frame pointer omission

When JF started programming there was "frame pointer omission" (FPO) which was cool because optmizers weren't as good as they are now
Back when you only had 8 registers for x86 the extra register could go a long way potentially -- stack is hot in cache but doing stores and loads to memory locations
Was known to some as "that flag that makes the debugger way worse" -- debug information has to be a lot more prescriptive when you can't simply describe where things are as an offset from a canonical (assumed unchanging) register
Modern CPUs doing register renaming under the hood against a much bigger micro-architectural register set -- not as worried about saving that one register as much of the time -- although in hot code you still might

00:11:49 "Leaf" functions

When you inline things you make bigger regions for analysis, ideally make big fat leaf functions
How much of program time is generally spent in leaf functions over some set of applications?
Function at the end of the call tree
If your subroutine doesn't call any other subroutines that's a nice property, because now you know that everything at the end of the stack belongs to you, you're just doing your work and popping back up to whoever called you
Inlining really unlocks power of leaf -- inlining into non-leaf-functions can make them become the leaf
So long as you don't over-inline and the working set doesn't become too big -- the compiler can know everything it does and have a good amount of work to do
Small region in which you can analyze everything, like tiny little whole program analysis

00:13:10 Why do we have a stack again?

Why can't we inline everything?
Two main issues: 1) don't necessarily know call graph for the whole program 2) recursion
If you knew where all the calls went (virtual/indirect/etc in your translation unit and other ones in your program), and without recursion, you wouldn't need a stack, you know a perfect call graph
For some of these you could avoid having a stack -- virtual functions but only a few actually implementations of it, could change to test-and-branch
If you have a fully analyzable virtual dispatch it effectively just becomes a switch, can potentially inline what the targets are
Control flow analysis takes indirect branch that can go anywhere and enumerate the real set of possibilities (devirtualization within a translation unit)
Fully analyzeable call graph is an interesting computer history topic: FORTRAN77 classically able to do this (programs were restricted enough you could analyze it)
XLA ML/array programming machine learning compiler has the same property where the whole call graph is analyzeable so you can create a slab that's the giant frame for the whole program you're optimizing and all allocations are known-fixed size
Whole program call graph analyzeability lives on in these niche use cases!
In stark contrast, sometimes we need multiple different kinds of stacks at the same time!
The JS engine would sometimes recur from JS calls through the VM runtime to other JS code, and that would need to potentially create a sub-stack (!) -- multi stack problems exist beyond even just needing to analyze/manage a single stack
Programming in FORTRAN is cool, for scientific code often trying to solve a specific physics problem don't usually need those tools like recursion or virtual functions
When everything is "monomorphized" -- you have big arrays of fixed-value-types you can know everything about the world and really optimize everything based off of it -- fun mode to be in for scientific computing code

00:16:34 Considerations beyond recursion and indirect calls?

Some languages use the stack for fast thread switching? Things like full stackful coroutines?
Stacks in Go for example are not contiguous: more like C++ deque: linked list of lists instead of one contiguous stack -- clever x86 code sequence that makes it fast to find previous and next frame
Allows Go stacks to be distinct allocations -- each page-wise is one frame and the next function has another frame -- can put multiple functions in one allocation
Used to have really bad perf if you were in a hot loop and happened to straddle that boundary
Coroutines in some languages ended up having some "stackless" stuff like this, where the closure is heap allocated instead
C++ coroutines try to do away with all the heap allocations, but depends on optimization level whether it can do that or not
Kind of similar for Objective-C blocks -- until recently always heap allocated, started being stack allocated in last few years where they could
Language doesn't say whether stuff lives on the heap or not
Because stack is less constrained can live in different places, e.g. in Go
In some cases you remove the allocation entirely

00:18:25 Scaling to millions of threads?

If you want to be able to scale your concurrency assumptions to millions of threads, you don't want to have huge stacks
Each thread has a stack, and if you have millions of threads you don't want to be allocating too much
And need to be able to switch between those threads quickly
So raises the question: how do you usually size those stacks in the per-thread context you have?
If you're doing tiny little operations; e.g. if every operation in your program was conceptually a thread, you wouldn't want to allocate 512KiB every time you did a tiny atomic operation

00:19:10 Managed languages putting frames on the heap

On the term "stackless": one of the Python "greenlet" ("lightweight thread" terminology) attempts was called Stackless Python
In managed languages like Python the frames can be allocated on the heap; e.g. in CPython the frames are allocated on the heap
But you still have a native stack for your native program that's doing the managed VM execution! (e.g. the interpreter code written in C)
Something that JITs can do is unify the stacks; so that the "interpreted" code uses the same stack space as the virtual machine -- single native thread of stack space
[ed: on deoptimization the JIT code can be reified into a managed stack frame on the heap through a process called "On Stack Invalidation"]
Frame is really just some space, spaces get linked together in the abstract concept of a stack, both of those things can also happen on the heap -- no reason this one virtual memory region that we call the stack is the place where that has to happen

00:20:20 Stack sizes in practice

GPUs have thousands and thousands of threads (organized in warps), you end up wanting to not want huge stacks
But typically on CPU between 512KiB and 2MiB in common environments (MacOS, Linux, Windows)
Allocated at thread startup
Most platforms have API to change the size -- some can only do at runtime, some can only do at startup
Being able to change the stack size is this "uneven" thing across platforms
That's common "userspace": if you look at kernel threads or embedded systems it may be way smaller, XNU: 16KiB
Blowing the stack can be easy in these cases when you're handling interrupts -- can end up blowing stack pretty easily so have to be careful

00:21:27 What are the biggest stack programs?

The reverse of "tiny stacks" environments is interesting -- what are the biggest stack programs that exist? Who does that?
In some environments that set stack limit quite low, when you compile in debug mode, you'll be spilling a bunch of things, and then you'll get a random segfault because the frame got bigger than the initial stack allocation was assuming or you hopped into the guard pages

00:22:08 Allocating default start size

Conceptually, doesn't all have to be allocated at the start do you?
Could grow on the fly as you determine that you need more space
Interesting mechanisms to do that at OS level and for managed languages you could even locate where your stack segment is
Just allocate a page and virtually map the other ones (no physical backing), fault when they get accessed, then map them in as needed
In managed languages when you know where all the pointers are you can relocate all the pointers that had anything to do with the stack locations

00:22:58 Stacks working nicely with LRU caches

Younger frames are recently touched
Spatial and temporal locality are two things that caches exploit
Spatial: when I touch a piece of data might touch something right next to it
Temporal: when I touch a piece of data might touch it again quickly (close in number of cycles / time)
Last-In-First-Out (LIFO) nature of stack and the predictable access pattern -- we go down in addresses and pop back up -- lines up with CPU features like LRU caches and address predictors trying to prefetch things you're going to be pulling in
Function's frame is usually in the cache in memory as "recently used"

00:23:54 Broader question: how does memory in a modern computer work at all?

Talking to stick of DRAM there are values going over the wires, those are the physical addresses
What the computer program sees is the notion of virtual addresses
Virtual memory introduced originally so we could share the physical memory on a system more easily in shared environments
Also offer facilities like protection, etc.
Virtual memory really means "I have some address" -- modern system [userspace] has effectively 47 bit addresses -- that value is really an "alias" -- it that turns into a real physical location
The alias is potentially unique to each program -- two different programs hypothetically could have different 47-bit values that the mapping could turn into the same physical location [ed: this property is the bane of virtually tagged caches]
Each one of these memory access instructions like a load or store needs to translate 47 bit virtual address into a corresponding physical address
You very likely have less than 47 bits of memory in your system (128PiB!) -- maybe you only have 4GiB (32b) of physical memory
Can pretend the physical space is larger and do tricks like page things out to disk, take things that would be backed by physical memory and store them on disk: magic possibilities of the virtual memory subsystem!

00:25:55 TLB Hit!

Each one of these accesses where I take a virtual address and turn it into a physical address
Can be expensive because OS has to maintain/walk a big table of how that translation happens
But there's a translation cache in the hardware called a TLB (Translation Lookaside Buffer)
Because the stack is accessed so frequently and is small in terms of number of pages, and pages are frequently used -- each stack access will usually hit inside of this translation cache
TLB hit! As opposed to a miss
Caches are smort
Should do an episode that talks about virtual memory subsystems in more detail!
Stack is LRU-cache amenable structure, and other benefits to it as well
Long winded way of describing how we ultimately arrive at TLB hits

00:27:00 Store-to-load forwarding of stack values

Other advantages to stack too -- registers are really fast, caches are fast, and go to main memory tier at some point
Modern CPUs have store-to-load forwarding -- "cache" that forwards values from recent stores to loads without having to ask the cache
Store to load forwarding effectively an "L0" cache -- stack can often hit in there -- many stack accesses can hit in this, acting as an L0 cache
[ed: alignment is another important property for being able to hit in the store to load forwarding pipeline!]

00:27:57 Multiprocessing teaser!

Let's say I'm a CPU in a multi-core die and I do a store, and I do store-to-load forwarding, but some other processor stored to the same location -- shouldn't I have seen the store that the other processor did?
"Stuck" in "L0 cache" -- hasn't been written to cache yet -- will it snoop? Coherency is working at the cache level.
How does it work, how can you think about it. Memory models! Once you're in multiprocessing world store-to-load forwarding can be even more interesting than a common optimization.
What properties do you want from your system? Are these single thread improvements "great until they're not?!"

00:29:00 Sparsity of stack slot usage

Each stage is contiguous
How a function uses the stack can be super sparse!
Compiler allocates a slot for each named value in your program
"Coloring" problem -- some parts of the function might have disjoint use of slots
Similar to register allocation -- set of resources is stack memory
Figures out which ones are mutually exclusion like register allocation
In if/else blocks there's mutual exclusion -- even if you reuse slots in that way you end up with parts of the stack that are unused
You could split paths? But probably a lot more overhead for the compiler to track the paths / stack implications and know how to roll them back
Do compilers ever split to optimize stack behavior?
Compiler will do partial outlining to make cold code less intrusive, force any required spilling on that cold path -- maybe that reduces required stack space to some degree?
Probably they need to unify their understanding of the stack at any given program point

00:31:50 Where does the stack go?

Do addresses grow up or grow down?
If A calls B is A's frame addresses bigger than B's frame addresses?
These days it's usually down: earlier in the call stack has higher address for its frame than its callees
Architecture like ARM is both bi-endian and bi-directional in its stack growth
Nothing to do with the abstract notion of push/pop.
Direction is interesting in general -- x86 has a direction register bit!
Changes the direction the addresses iterate for REP (repeat) instructions.
There are these various convention(s) we adhere to based on ABI.
Different implementations do different things when they implement the ABI.

00:33:07 Seemingly arbitrary choices in computer systems

Always fun to see these -- little vs big endian, address map with 0 at top or bottom?
Endianness from Gulliver's Travels: Lilliputians crack and egg on one side or the other (big or little side) that led to holy war.
We have a "war" over whether the low bytes of a multi byte word go at the low memory or high memory [byte] address, we similarly fight these wars over interesting small issues [ed: tabs vs spaces, anybody?].
If you only had the ability to access a word, you wouldn't know what the endianness of a machine is! If you can't observe the byte storage. Would need to fix alignment as well -- or only have word addresses. Word.

00:35:05 Unwinding

Let's say you want to grab a backtrace from your program because it crashed or there's a bug or want to know where it is for diagnostic purposes, or in C++ you throw an exception and need to unwind it, any exception guards to trigger? etc
Even just programming normal C you might want to capture a backtrace to emit a good message on segfault or something, send crash diagnostics back to home base in the case of browsers or something
[ed: or sampling profiling!]
Conceptually just want to walk the linked-list-like structure described earlier
Frame pointer to find older frame, look next to it to find the return address
Linkage between frames and code that caused the frame to be on the stack (via return address)
Lots of cool libraries people have developed to do this: libunwind, libunwind-llvm, libunwind-gcc, crash reporting libraries that wrap up unwinding facilities e.g. breakpad -- what code was executing when something bad like a crash happened?
Ability to walk stack structure and ask questions about it is key!

00:36:40 Walking with frame pointer omission?

If you think about a basic unwinder, all you have to do is walk the stack itself -- tiptoe through the linked list.
But what if you elided frame pointers.
Side data structure that tells you how to determine the frame size for each function.
Compiler knows what it is, so can emit "side data" in the program.
Unwinder needs to look at current program counter -- for callers that gets pushed onto the stack (for the call), callee pops it into the program counter to go back to where it was
Program counter can be used to look up the frame size data
Read only information that the compiler generates keyed off of program counter
Less info to carry on the stack, and only need to look up those side data structures if you're trying to do unwinding
Goes into binary section like DWARF info if you're using ELF binaries, or PCOFF or what have you on Windows
If you're doing unwinding have to call RAII destructors or finally blocks in languages like Java
In other cases may want to do setjmp/longjmp with landing pads -- Win32 did that on x86 to implement exception handling
As you unwind the stack you somehow have to figure out where your landing pads are because those have the code that do the cleanup
For exceptions you have to run code: destructors, catch blocks, etc
Unwind one, look up landing pad, do some work, look up the next landing pad
If you have an exception that happens during the "work", have to deal with that
On Windows it mixes Structured Exception Handling (SEH) with C++
If you have faults or divide by zero floating point you can handle it with SEH
In Windows C++ code you can catch unrelated exceptions from SEH which can be surprising
So with SEH you can write a trap handler in a C program -- might use signals for in other environments like POSIX

00:40:35 What are setjmp/longjmp?

Way to save the current state of registers and where you are on the stack and such
setjmp buffer -- saves it, longjmp you give the buffer and it puts the state back the way it was before
setjmp is a function that can return "twice" (a la clone()) -- first time it saves the state, when you longjmp back it tells you that the state has been restored -- a function that can return twice!
Local state inside of the CPU, sometimes called a microcontext? Registers in it and such to restore to make the local processor state look the way it was
Heap can't go back, but local processor state can go back to where it was
Similar to what we do when we enter kernel space in a way, stash a bunch of state away so the kernel can start doing its thing, then you can resume
Abstract notion of continuations we may have to talk about at some future point

00:41:53 Variable sized stack entities!

We've been talking about frames being of a set size, but there are these functions in C and (kind of) C++, called alloca, and you also have variable length arrays (VLAs)
N bytes is the size, and declare it goes on the stack
Interesting topics: how are they optimized, what's their behavior when it's dynamic, and such!
If you look at standard C++ there's no variable length arrays, common extension to support C VLAs in C++ but not officially a thing that exists
Neat to be able to change the size of the stack frame at runtime
Small stack buffer optimizations using alloca; e.g. "only up to 256 bytes of stuff, otherwise I'll heap allocate"
alloca is kind of an extension to the language in a way
alloca/malloc and then corresponding nothing/free before it exits the scope
alloca(0) technically undefined behavior, as is VLAs of 0
Can try to optimized based off of that undefined behavior, but it breaks things!
What about VLA of negative value? Can't un-grow the stack!
Alloca/VLAs don't fail!
Is it pronounced alloca or "alloh-ka"? We're not sure, but we do know JF is French-Canadian. Alloc, eh?
Returns a pointer to the stack location, can it return null? Would have to be slower if it nominally could fail.
Part of language or OpenGroup POSIX standard? C Programming Language? We're not sure, maybe part of POSIX.
Within LLVM everything is represented as alloca and creates stack out of consolidated allocas, and put things into registers that it can.

00:45:31 Stack buffer overruns

We put lots of things on the stack, we see some are variable sized
Sometimes we put fixed size array on the stack and think we're writing into the right way
But O.G. buffer overrun exploit was to overrun an array that somebody put on a stack, then attacker gets you to clobber the return address -- write an address into there, and maybe jump to the data I had written if the stack were executable. Nowadays the stack is never executable.
Pages in virtual memory have an "execute" bit on them that says whether you can fetch instructions into the CPU out of that virtual memory location.
Fairly new? CPUs 20 years ago couldn't do that
Fascinating all the security/mitigations added recently in computer history.
Originally learned about the buffer overrun exploit in school, and nominally not possible to happen anymore, especially because compilers had facilities for inserting protector guards and things
Exploit shows how everything mixes together in the stack: instruction pointers, data you want to home there.
In a memory unsafe environment an attacker can potentially take advantage of the fact that these all mix together, things like virtual table pointers and such may be in there.
Lots of extra technologies came in later not just from CPU side but from compiler trying to protect stuff.
Stack protector: when you create a large frame (say 8KiB stack array) -- statically (when frame is constructed) or dynamically (as the code executes) will touch all the pages that the code might access. If you went to go access the end right away, might run afoul of the guard page (e.g. if it's a 4KiB single guard page).

00:48:56 What are guard pages?

When the OS creates a thread it'll map a certain amount of stack
It could do that lazily or eagerly
Imagine it eagerly allocates 16KiB, can put a 4KiB "guard page" at the end, marked as not readable or writable
Intent is as you naturally run out of stack you'll fault, more will be provided with physical backing, and execution will be resumed
Have been exploits in the past where programs just "jump over" accessing the guard page and the exploit uses those accesses to do malicious things -- past that guard page is other memory, either another stack, or a heap, etc.
Stack is just another part of the memory space!
Struct offset can be any value -- can have 4GiB of data in an inline array!
In the conventional sense where stack grows down and heap grows up, at some point they might meet! Could jump from one into the other... it's all just locations in memory.
Or if you have multiple stacks maybe stack can collide with another stack.
At the end of the day we have this big flat address space in an abstract sense on modern machines so jumping around inside you have to know the extents.
Brings up topics like how segment registers have evolved on x86: future episode!

00:51:20 The Red Zone

Speaking of x86 evolution! Sounds cool... Red Zone.
Idea you can guarantee some space underneath your frame; e.g. 128B on x86-64,
As leaf function, you know there's extra space that's guaranteed to be there, avoid accounting and clobber stuff relative to frame pointer without having to do accounting.
Part of ABI that that extra space exists there.

00:52:00 Alignment!

Usually you always try to have an aligned stack. Byte aligned stack is weird when most of the point is to spill registers and registers are 4B or 8B.
When ARMv8 was being designed all stack specific instructions (manipulating, aligning) had to fault if the stack itself wasn't aligned properly
Expected any stack manipulation spills at least two 64-bit registers -- invariant: aligned to 16B all the time!
Instructions not just specialized to the stack, also trying to find incorrect usage.
More you know about how your system should be working, the easier it would be to detect if something is trying to be exploited or going off the rails potentially.
Invariants about how system should be operating get stronger.
Prevents people from doing
super-creative-if-somewhat-dangerous-in-the-general-case stuff like "futurist programming" idea: self modifying code tight inner loops to get all the efficiency of a modern machine.
Just to say: some things can become more difficult as we start make this general process around how things are supposed to work.

00:53:45 Processes vs performance

Establishing processes and conventions for "how a machine is used" vs performance
Classic considerations around what registers are saved by caller vs those saved by the callee, can have a big impact on performance
On architectures that by convention had purposes on different machines could make these considerations even more interesting; mul using ecx, rep using edi
Questions of "what do we save away when switching from userspace to kernelspace" -- some set of resources that's not too small but also not overly large
Kernel threads having full separate stack once they leave interrupts to go to full kernel thread
Big space of conventions around performance and the stack that's always fascinating

00:55:45 Managed Virtual Machines: WebAssembly

In WebAssembly all the code is totally separate from the data
WebAssembly meant to be secure VM that can run inside of the browser
Most C/C++ programs use von Neumann machine model where data and code are mixed together
WebAssembly follows Harvard architecture (code and data are kept separate)
Writing C/C++/Rust program in WebAssembly you don't know where any of the code is in an absolute sense
WebAssembly really has two stacks!
Call stack with internal state having function pointers etc.
Whatever state needs to be reified in your web assembly program; e.g. taking address of variable, needs to see the values
Separate stack in the WebAssembly memory space separate from the call stack
If WebAssembly is running in a web browser it's interleaved with JS
Within WebAssembly memory there's a separate stack for your program's execution -- part of the ABI
Extra neat: if you look at the virtual ISA it doesn't have a finite set of registers!
Functions can have infinite registers, can return infinite set of values
"Virtual register" and "variadic return value arity"
Functions have fixed number of inputs
How does it do printf/va_lists?
Compiler creates a stack in user accessible memory (the WebAssembly stack) you spill to
printf routine can access those stack spills into WebAssembly memory

00:58:48 Whole stack'o browser stacks

In full browser, there's native stack, JavaScript stack which JIT can unify with the native stack, also WebAssembly stack -- WebAssembly is also being JIT compiled and potentially pushing things onto the native stack again!
Whole ecosystem of things all have interesting and potential unifying treatment of the general notion of stacks
Like the glue that binds program execution together, in a way!
Super integral to how we think about how machines are going to work today

00:59:30 Managed VM's rich stack choices

In managed VMs you can potentially put more invariants on the program being executed (you are managing what the language is doing)
CLR/.NET you have a typed stack [every slot has a known fixed type]
Projects like Pydgin mapped Python onto the typed stack, but CPython has both data and control stack per frame, so figuring out how to mux Python execution onto the CLR notion of stackiness in this managed world creates all these interesting ideas/interactions
Around what a stack should be, how it should operate, what invariants we know about it; e.g. whether the type of a given slot is always the same, or is it a big bag of bytes
With typed arrays; are the typed arrays words or are they bytes, etc. Lots of variations on how we might think about them.

01:00:40 What happens at the end of the stack?

In C++ when you stack overflow it's not defined where stack end is or what happens when you run out of stack
Realistically what happens is you'll trap because of guard page in most implementations
In JavaScript not defined when you stack overflow
Defined that if you stack overflow it a JavaScript exception is thrown, and you can catch it!
Neat as a programmer, though it is kind of hard to handle
As a VM implementer it's kind of painful -- when you JIT code you're careful about the JIT-ed code not running out of stack similar to stack protector
But runtime functions that are supposed to support the JIT use stack space, and could violate the stack extent!
Now need to throw into the JavaScript code from a out-of-stack condition inside native code
Can be a source of bugs in JS VMs -- JS fuzzers can do a cute "recurse, call helper, unrecurse, call helper" to mess with VM to fuzz those runtime functions interacting with stack overflow
Working on JS engine in FireFox you turn on the fuzzer when some feature is ready and you get the deluge of interesting things that are corner cases in the universe
Really good fuzzers out there for JavaScript: resulting code is evil, but great stuff, that's what you want
Reducers are also the stars of the show -- getting the crisp short program that shows some key problem is always amazing to see

01:03:48 Stack limit!

More things we haven't talked about: rotating register windows, return predictors
[ed: ROP gadgets!]
Infinite stuff to talk about: unlike the stack!
Write to us on Twitter for feedback!

Terminology Glossary

VM / Virtual Machine: an overloaded term; however, most of the time that we use it in the podcast we're referring to the notion that a managed language implementation runs on an abstract machine that be thought of as designed for that language in particular -- e.g. Python has a set of stack-based bytecodes and associated runtime support that conceptually implements the abstract Python virtual machine; similarly for the Java Virtual Machine and so on.
JIT: Just-in-Time compiler -- these are often used for speeding up managed language environments by compiling managed programs to native code at runtime
"Stackless": "Stackless Python, or Stackless, is a Python programming language interpreter, so named because it avoids depending on the C call stack for its own stack." -- [https://en.wikipedia.org/wiki/Stackless_Python]
reify: to "make real", i.e. materialize somehow in the computer or program; antonym: erasure: erasing information from being present in the program that had been determined in some prior phase like compilation
translation unit: the input that is processed by a single compiler invocation (akin to a "module" concept in languages with modules), serves as a natural scope for compiler optimizations

Episode 0: mov fp, sp

00:00:00 Intro

00:00:59 Disclaimer

00:01:42 What's the stack?

00:03:50 Mechanisms in the processor

00:05:10 Comparison to an abstract stack machine

00:06:05 Some instruction set considerations

00:07:09 Compiler optimizations and stackiness

00:09:17 Eliding heap allocations in C++

00:10:18 Frame pointer omission

00:11:49 "Leaf" functions

00:13:10 Why do we have a stack again?

00:16:34 Considerations beyond recursion and indirect calls?

00:18:25 Scaling to millions of threads?

00:19:10 Managed languages putting frames on the heap

00:20:20 Stack sizes in practice

00:21:27 What are the biggest stack programs?

00:22:08 Allocating default start size

00:22:58 Stacks working nicely with LRU caches

00:23:54 Broader question: how does memory in a modern computer work at all?

00:25:55 TLB Hit!

00:27:00 Store-to-load forwarding of stack values

00:27:57 Multiprocessing teaser!

00:29:00 Sparsity of stack slot usage

00:31:50 Where does the stack go?

00:33:07 Seemingly arbitrary choices in computer systems

00:35:05 Unwinding

00:36:40 Walking with frame pointer omission?

00:40:35 What are setjmp/longjmp?

00:41:53 Variable sized stack entities!

00:45:31 Stack buffer overruns

00:48:56 What are guard pages?

00:51:20 The Red Zone

00:52:00 Alignment!

00:53:45 Processes vs performance

00:55:45 Managed Virtual Machines: WebAssembly

00:58:48 Whole stack'o browser stacks

00:59:30 Managed VM's rich stack choices

01:00:40 What happens at the end of the stack?

01:03:48 Stack limit!

Terminology Glossary

Episode 0: `mov fp, sp`