writing assembly by hand

September 29, 2021

motivation for learning assembly

Pursuing deeper understanding of what the machine actually does, I’m writing a bunch of assembly by hand. This week we’ve worked through basics of asm at Bradfield. This post intends to organise my knowledge and major learnings.

It isn’t pleasant, but it is fun.

I’m driven by wanting to understand computers, and I honestly believe that understanding the execution of a program from first principles is not only important, but absolutely required for any kind of high-impact work. I won’t lie to you, the experience was quite shocking, but with the shock came enlightenment. Nothing worth doing is done easy.

When I program in higher-level languages, I don’t usually think low level, and being shielded from all the details of how the machine operates eventually started creating discomfort for me. There’s no realistic need for most of my work to be remotely near assembly code, but there’s so much that his shrouded in mystery of the compiler and whichever VM my js or python code is running.

So why would I learn to write machine code? Realistically I won’t ever write a line of production assembly unless I sell my soul to the devil for a handful of peanuts, but reading disassembly and debugging using lldb or gdb has always been difficult and cumbersome for me.

It’s not about being able to write programs in assembly, but being able to read and understand the code that is generated by the compilers.

  • Experience raw machine instruction sequences.
  • Compare what I think my code will do with what compilers actually spit out.
  • Get better at debugging using lldb and gdb.
  • Be able to say to my friends who study electrical engineering that I also ‘wrote assembly’. It upsets them, to my amusement.


First of all, getting this knowledge isn’t as easy as one may be used to when searching for information on <programming language> du jour. Here are the sources I’ve used as references:

If you only look at one book from this list, please consider CS:APP. It is a very involved text (expect to take weeks to go through it, especially if you work through the exercises), but its well worth it.

things I’ve learned

compilers are really clever

Try going to godbolt and inputting even something as simple as this c snippet.

int sum_to(int n) {
  int total_sum = 0;
  for (int i; i < n; i++) {
    total_sum += n;
  return total_sum;

int main() {
    return 0;

Take note of different optimization levels (pass the -O{0,1,2} compiler option, with the higher number meaning more optimized), and watch what happens. Using x86_64 clang 3.1, I was able to notice that not only the higher optimization levels do, duh, provide more succinct and clever instructions, but also completely skip looping, having recognized an arithmetic sequence.

Downside of that is that invoking higher levels of optimization option does generate code that is way further away from the structure of the code you’ve written. This way the structure of your original C code is heavily transformed and the relationship between your incantations and the output machine code will lose in most cases direct mapping completely.

That doesn’t mean of course you shouldn’t be slapping those -O2 's in your clang invocations. It just means that if you do wish to track program execution closer for the purpose of learning assembly like me here, you may want to compile them without optimizations to begin with. Or use godbolt to get help with the mapping.

execution of instructions out of order

One of ~lies~, cough, leaky abstractions I’ve acquired while studying this topic is the format and behavior of a ISA (instruction set architecture) described as if each instruction was executed in a sequence, with one completing before the next one starts. The modern processor is way more elaborate and for the most part executes multiple instructions concurrently. Of course, it employs appropriate safeguards in order to ensure that the overall behavior of the program matches the sequential operation as described in the ISA.

I’ve mentioned it in my blogpost on simulating the fetch-decode-execute loop. CS:APP describes this issue in great detail, but as usual, wikipedia does a pretty good job of providing an overview.

even tiny programs build into way larger constructs

I’m pulling this example from section 3.2.2 of CS:APP. Given a file containing this function definition:

long mult2(long, long);

void mulstore(long x, long y, long *dest) {
  long t = mult2(x, y);
  *dest = t;

That produces this assembly:

  pushq   %rbx
  movq    %rdx, %rbx
  call    mult2
  movq    %rax, (%rbx)
  popq    %rbx

The total size of the file after compiling and assembling with gcc -0g -c mstore.c, will be… 1,368 bytes! Bear in mind that the entire set of instructions that makes mulstore do what it does is only 14-bytes long (53 48 89 d3 e8 00 00 00 00 48 89 03 5b c3). And that’s just the mstore object file!

When we create the entire executable by running gcc -0g -o progr main.c mstore.c, we are gonna be looking at 8,655 bytes, that contain not only the machine code for the procedures we provided, but also the code that starts the execution, terminates it, and communicates with the host operating system.

This gave me a perspective on the size of executables that I’ve been observing (some larger programs _really grow in size).

think twice before targetting functionality of a specific ISA or CPU in your code

For compiled languages, if you’re using instructions that employ features of a specific piece of hardware you may find yourself writing code that is really going to trouble other CPUs. One example of this in C would be using the long double declaration for floating-point numbers.

Floating point numbers come in two formats - single precision, that take up 4 bytes, and double precision, that take up 8. That corresponds to C data types float and double, respectively. Microprocessors from the x86 family have historically implemented the floating operations with a special 10 byte format. If you use this declaration on any other architecture, you will not get any benefits of this high-performance hardware, or the compiler will refuse compilation altogether.

different mov instructions into registers treat remaining bits differently

Accessing integer registers on x86 can be done using different sizes - by a single byte, a word (16-bit), double word (32-bit), and quad word (64-bit). You can basically ask to take or store bits only of certain range. If all you need is a byte, then you can totally just do that.

What happens to the remaining bits though? Well, it depends.

There are two different conventions regarding what to do with the upper bytes of the destination register. Look at this example:

movb $-1, %al                       ; %rax = DEADBEEFBABABEFF
movw $-1, %al                       ; %rax = DEADBEEFBABAFFFF
movl $-1, %eax                      ; %rax = 00000000FFFFFFFF
movq $-1, %rax                      ; %rax = FFFFFFFFFFFFFFFF

Note that a hex representation in two’s complement is FFFFFFFFFFFFFFFF for a quad word.

If you look closely at the values of the register %rax, you’ll see that the treatment of higher-order bits is different when moving bytes and words, and different when moving double words (longs) and quad words. The higher-order bits when moving longs get set to 0, which wasn’t the case in the b and w variants of the mov instruction. Take note if you care about the higher order bits!

Things get even more messy when it comes to differences between movb, movsbq, and movzbq. Discussion of which I direct you to section 3.3.3 of CS:APP.

summing up - compatibility matters

I’m glad I spent the time on this. It’s been a very enjoyable learning. One thing I found is that the tooling and compatibility of the software on macos hasn’t been as nice as what’s found on linux. I have had to resort to using dedicated digitalocean droplets and docker images in docker vm on my macos to use gdb and perf, with various levels of success. For the purpose of practice I also set up a raspberry pi but since it’s an ARM processor, I’ve ended up on a completely another tangent. All computers I program in the cloud though are x86 linux machines. Time for new hardware?

Written by Daniel Kaczmarczyk, a software engineer and educator. you can find me on twitter or email me at daniel.kacz@protonmail.com

a pale blue and yellow circle