Assembly segmentation fault during retq - 【StackMirror】|assembly|segmentation-fault|stack|x86-64

I have a bit of assembly code that calls another using callq. Upon calling retq, the program crashes with a segmentation fault.

    .globl  main
main:                   # def main():
    pushq   %rbp        #
    movq    %rsp, %rbp  #

    callq   input       # get input
    movq    %rax, %r8

    callq   r8_digits_to_stack
    # program is not getting here before the segmentation fault
    jmp     exit_0

# put the binary digits of r8 on the stack, last digit first (lowest)
# uses: rcx, rbx
r8_digits_to_stack:
    movq    %r8, %rax       # copy for popping digits off

    loop_digits_to_stack:
        cmpq    $0, %rax    # if our copy is zero, we're done!
        jle     return

        movq    %rax, %rcx  # make another copy to extract digit with
        andq    $1, %rcx    # get last digit
        pushq   %rcx        # push last digit to stack
        sarq    %rax        # knock off last digit for next loop
        jmp     loop_digits_to_stack

# return from wherever we were last called
return:
    retq

# exit with code 0
exit_0:
    movq    $0, %rax    # return 0
    popq    %rbp
    retq

Where input is a C function that returns keyboard input to %rax.

I assume that this might have something to do with the fact that I'm manipulating the stack, iis that the case?

2015-09-30 07:55
by Langston

Thanks for the link, I'll update soon - Langston 2015-09-30 08:33

deleted my old comment, you did actually say where your code faulted - Peter Cordes 2015-09-30 08:38

I think one of your return paths doesn't pop rbp. Just leave out the

pushq   %rbp
movq    %rsp, %rbp

pop     %rbp

altogether. gcc's default is -fomit-frame-pointer.

Or fix your non-return-zero path to also pop rbp.

Actually, you're screwed because your function appears to be designed to put stuff on the stack an never take it off. If you want to invent your own ABI where space below the stack pointer can be used to return arrays, that's interesting, but you'll have to keep track of how big they are so you can adjust rsp back to pointing at the return address before ret.

I recommend against loading the return address into a register and replacing a later ret with jmp *%rdx or something. That would throw off the call/return address prediction logic in modern CPUs, and cause a stall the same as a branch mispredict. (See http://agner.org/optimize/). CPUs hate mismatched call/ret. I can't find a specific page to link about that right now.

See https://stackoverflow.com/tags/x86/info for other useful resources, including ABI documentation on how functions normally take args.

You could copy the return address down to below the array you just pushed, and then run ret, to return with %rsp modified. But unless you need to call a long function from multiple call sites, it's probably better to just inline it in one or two call-sites.

If it's too big to inline at too many call sites, your best bet, instead of using call, and copying the return address down to the new location, would be to emulate call and ret. Caller does

    put args in some registers
    lea   .ret_location(%rip), %rbx
    jmp   my_weird_helper_function
.ret_location:  # in NASM/YASM, labels starting with . are local labels, and don't show up in the object file.
         # GNU assembler might only treat symbols starting with .L that way.
    ...


my_weird_helper_function:
    use args, potentially modifying the stack
    jmp *%rbx   # return

You need a really good reason to use something like this. And you'll have to justify / explain it with a lot of comments, because it's not what readers will be expecting. First of all, what are you going to do with this array that you pushed onto the stack? Are you going to find its length by subtracting rsp and rbp or something?

Interestingly, even though push has to modify rsp as well as do a store, it has one per clock throughput on all recent CPUs. Intel CPUs have a stack engine to make stack ops not have to wait for rsp to be computed in the out-of-order engine when it's only been changed by push/pop/call/ret. (mixing push/pop with mov 4(%rsp), %rax or whatever results in extra uops being inserted to sync the OOO-engine's rsp with the stack-engine's offset.) Intel/AMD CPUs can only do one store per clock anyway, but Intel SnB and later can pop twice per clock.

So push/pop is actually not a terrible way to implement a stack data structure, esp. on Intel.

Also, your code is structured weirdly. main() is split across r8_digits_to_stack. That's fine, but you're not taking advantage of falling through from one block into the other block ever, so it just costs you an extra jmp in main for no benefit and a huge readability downside.

Let's pretend your loop is part of main, since I already talked about how it's super-weird to have a function return with %rsp modified.

Your loop could be simpler, too. Structure things with a jcc back to the top, when possible.

There's a small benefit to avoiding the upper 16 registers: 32bit insns with the classic registers doesn't need a REX prefix byte. So lets pretend we just have our starting value in %rax.

digits_to_stack:
# put each bit of %rax into its own 8 byte element on the stack for maximum space-inefficiency

    movq   %rax, %rdx  # save a copy

    xor    %ecx, %ecx  # setcc is only available for byte operands, so zero %rcx

    # need a test at the top after transforming while() into do{}while
    test   %rax, %rax  # fewer insn bytes to test for zero this way
    jz  .Lend

    # Another option can be to jmp to the test at the end of the loop, to begin the first iteration there.

.align 16
.Lpush_loop:
    shr   $1, %rax   # shift the low bit into CF, set ZF based on the result
    setc  %cl       # set %cl to 0 or 1, based on the carry flag
    # movzbl %cl, %ecx  # zero-extend
    pushq %rcx
      #.Lfirst_iter_entry
      # test %rax, %rax   # not needed, flags still set from shr
    jnz  .Lpush_loop
.Lend:

This version still kinda sucks, because on Intel P6 / SnB CPU families, using a wider register after writing a smaller part leads to a slowdown. (stall on pre-SnB, or extra uop on SnB and later). Others, including AMD and Silvermont, don't track partial-registers separately, so writing to %cl has a dependency on the previous value of %rcx. (writing to a 32bit reg zeros the upper 32, which avoids the partial-reg dependency problem.) A movzx to zero-extend from byte to long would do what Sandybridge does implicitly, and give a speedup on older CPUs.

This won't quite run in a single cycle per iteration on Intel, but might on AMD. mov/and $1 is not bad, but and affects the flags, making it harder to loop just based on shr setting the flags.

Note that your old version sarq %rax shifts in sign bits, not necessarily zeros, so with negative input your old version would be an inf loop (and segfault when you ran out of stack space (push would try to write to an unmapped page)).

2015-09-30 08:18
by Peter Cordes

Oh, I'm trying to use return to return from the call to r8_digits_to_stack and exit_0 to exit the program. Is that not the correct way to do that - Langston 2015-09-30 08:31

Is it not possible to have a function put stuff on the stack and have another function use it later? Sorry for my lack of knowledge, thanks for your help - Langston 2015-09-30 08:34

@Langston: ok, I was only skimming your code. Yeah, use ret to exit main (or make an exit system call if you're writing stand-alone asm code that doesn't use the C startup files. (gcc -nostartfiles, or just as && ld)). But see the update to my answer about how you're using the stack. I think when you try to ret at the end of r8_digits_to_stack, the stack pointer isn't pointing at the return address. So no, it's not normal to do that - Peter Cordes 2015-09-30 08:36

@Langston: updated my answer with some more code-review and an attempt at optimizing the loop. My version will do poorly on Intel pre-Sandybridge, but saves several instructions - Peter Cordes 2015-09-30 09:49

Awesome, thank you for your continued help! This makes things a lot clearer - Langston 2015-09-30 17:21

@Langston: cheers, glad I could help. At first I was just thinking "this is crazy and can't possibly work", and then I had an idea and invented a not-terrible way to actually do it, changing it to "crazy but works", so that was fun - Peter Cordes 2015-09-30 17:38