Format strings solve the issue of allowing a string to be output that includes variables formatted precisely as dictated by the programmer. The format must be specified via a placeholder. They are not malicious themselves, they are just a basic mechanism that allows to print variables inside a string in a specific format. This mechanism is possible thanks to special characters called placeholders.
Definition
Placeholders are special characters that specify how the data inside the string should be formatted, the basically specify how to format a variable and its position inside the string (correspondence between placeholder and variable). In addition, they specify the compiler how many parameters to expect to print in the string, there is correspondence between the position of the placeholder and the variable to be printed.
From the memory point of view, placeholders just specify the format in which to interpret a specific memory cell, the CPU is told that a certain cell is an int/char/so on. There are a lot of placeholders (especially in C):
Placeholder
Description
%d / %i
Decimal
%u
Unsigned decimal
%o
Unsigned octal
%X / %x
Unsigned hexadecimal
%c
Character
%s
String (char*), prints until \0
Besides specifying the format, placeholders may be used to read/write in memory. Combining the power of reading and writing, the attacker is being provided with whatever they need to complete the attack and do anything in memory. Format strings mechanisms typically affects printf, fprintf, vfprintf, sprintf, vsprinf, snprintf, vsnprintf. However, this does not mean that all of them are vulnerable to format strings exploitation.
The problem is more complex than buffer overflow, it is not limited to printing functions exclusively.
Warning
A program is vulnerable to format string bugs when a printing function, that theoretically consents format strings, has no format specifications about the format of the input that will be inserted: this is the vulnerability, not print with format strings but the fact that printing functions has no format specs.
This is a problem because in this case the attacker did not insert a real string, but special characters (placeholders directly without the corresponding variable) and the CPU will go somewhere in memory and print memory cells instead of the variable that would be expected to be there. CPU is printing a number of memory cells equal to the number of placeholders that the attackers provides in input, because there is no correspondent variable. If the format of the variable to print is specified the function is not vulnerable, for example:
snprintf(buf, 250, "%i, %i, %i", a, b, c); // non vulnerable snprintf(buf, 250, "%x, %x, %x"); // vulnerable
Example
In this way it is possible to read what is already on the stack: when the format string is parsed, snprintf() expects three more parameters from the caller to replace the %xs. According on the calling convention, these are expected to be pushed on the stack by the caller. Thus, the snprintf() expects them to be on the stack before the preceding arguments. It is also possible to read something that is being put on the stack, such as string in snprintf(buf, 250, "AAAA");.
In this case, when the function is called the address of the variable is pushed on the stack with the other arguments. At that address the format string can be found.
Knowing the address, the purpose is now to enter the memory and find the string: to do this some placeholders should be added, for example snprintf(buffer, sizeof(buffer), "AAAA %x, %x, %x");. At some point the string will be found. Going back on the stack usually part of the format string can be found, so it’s possible to read what is being put on the stack.
The number of placeholders to use in this case depends on the program. There must be a way to automatically find a certain string in the stack: another mechanism related to placeholders that can be used: in this case is %N$x syntax (go to th parameter). is the number of placeholders, so %N$x is the direct parameter access (accesses th element), is the desired placeholder. To find a specific element, it is necessary to combine this mechanism with some shell scripting. For example let’s consider a simple C program that simply print what’s passed as a parameter (argv[1]) when called:
#include <stdio.h>int main (int argc, char* argv[]){ printf(argv[1]); return 0;}
then compile and run it:
gcc -o vuln vuln.c./vuln "ciao"ciao./vuln3 "%x %x %x" # The actual values and number of %x can change depending on machine, compiler, etc.buffer: b7ff0ae0 66663762 30656130
We can use the shell script to execute multiple times the program in order to print more element from the stack:
for i in 'seq 1 150'; do echo -n "$i " && ./vuln "AAAA %$i\$x"; done1 AAAA b7ff05902 AAAA 804849b# ........lots of lines...... # 1 dword from the stack per line 150 AAAA 53555f6e
This scripts will print all the memory cells up to 150 (toward higher addresses). Using the grep utility, it’s possible to filter the output and find the desired cell of the stack.
for i in 'seq 1 150'; do echo -n "$i " && ./vuln "AAAA %$i\$x"; echo ""; done | grep 4141114 AAAB42414141 # there is my cell and I can read from! We had to go 114 positions up.
This is concatenated with the grep command, which allows to search for something: only the strings that matches the 4141 pattern will be printed. After doing this, with
./vuln "AAAA %$114\$x"AAAB42414141 # So, we can effectively read.
the string is effectively read. Let’s now consider:
In this case, the vulnerable function is called within the function test(): more arguments will be put on the stack, since more functions are called, the stack contain more variables.
The function grep can be used to scan the entire stack and look for interesting data in memory (for example, if the name of some delicate variables knowing which authentication can be bypassed is known).
The basic format string vulnerability is that it’s possible to read whatever is on the stack. But what if the attacker can also write? There is a placeholder for that, $n. This placeholder writes in the address pointed to by the argument the number of chars (bytes) printed so far. It writes in memory the number of characters printed so far:
In this program segfault happens at 41414141 because %n wants to write at the address pointed to by the argument. However, there is no variable, the considered address is the variable at position 2, which in this case is “AAAA”. If instead of putting “41414141” the attackers puts a real address and changes the string “AAAA” they will write “4” at that address.
While in buffer overflow the stack is smashed, format strings directly write at a specific address in memory. Any address can be encoded in bits. %n pulls an int* (address) from the stack, goes there and writes the number of chars printed so far. In this example that address is 0x41414141.
This mechanism can be used to put in place attacks by:
putting on the stack the address (addr) of the memory cell (target) to modify (as part of the format string)
using %x to find it on the stack (%N$x)
using %n instead of %x to write a number in the cell pointed to by addr, i.e. target
Python can serve as a way to write an address:
./vuln "AAAA%2$n"./vuln "'python -c 'print "AAAA%2$n"''" # -c prints non-printable characters./vuln "'python -c 'print "\x41\x41\x41\x41%2$n"''" # %n writes '4' at the element at position 2 = 0xbffff6cc./vuln "'python -c 'print "\xcc\xf6\xff\xbf%2$n"''" # this part (element at position 2) of the format string is printed
But how to change the number into an arbitrary (controllable) number? The placeholder %c allows to specify the padding (how many characters to encode). However, in order to avoid writing GB of data, each DWORD of the address (32 bits, up to 4 Gb) is split into 2 WORDs (16 bits, up to 64 kb) and write them in 2 rounds. 4 Gb might not be much for a computer, but they are e.g. for IoT devices, which can be vulnerable.
There is a limitation: the placeholder %c cannot count down: it is a counter of the number of characters printed so far.
For example to print 10 we should print first 3 and then 7 (3+7=10), but not 7 and then 3.
In the first round we should write the number with lower absolute value and then the other one, not the other way round. The counter can not count down (we could overflow). Performing the writing procedure twice means writing in the same format string: the format string is only exploited once, two addresses are written at the same time. We have to put in memory the 2 target address and their displacements first, then we have to do some math to compute the number that corresponds to the one we want to write there.
Addresses are put on the stack as part of the format string, the attack goes as follows:
put the 2 target addresses of the memory cells to modify on the stack (as part of the format string, at the beginning)
use %x to find <target1> on the stack (%N$x), <target2> will be at pos + 1 (one DWORD up: target+2), where pos is target1’s displacement
use %x and %n to write the lower absolute value in the cell pointed by target1, the higher decimal value in the cell pointed by target2
While writing, we might be writing some 00s as well. After pushing the two addresses, since 8b are already on the stack for the target addresses, they should be removed from the number we want to write.
For the second number, we should again calculate the difference between the higher and lower numbers (equals the remaining bytes). If we use %hn instead of %n we can avoid the zeroes problem: tt%hn directly writes 16 bytes. In this case the purpose was to write an address at a certain target address.
The general form of a format string is:
# if the first half is > that the second one<tgt (1st 2 bytes)> # where to write (hex, little endian)<tgt+2 (2nd 2 bytes)> # where to write +2 (hex, little endian)%<low value - printed>c # what to write - #chars printed (dec), all printed chars so far should be considered%<pos>$hn # displacement on the stack (dec)%<high value - low value>c # what to write - what written (dec)%<pos+1>$hn # displacement on the stack +1 (dec)
The main phases of exploiting format strings depend on what (address) we want to write:
If the most significant (left) part is higher than the least significant part, the low value is printed first in the format string
If the most significant part is lower than the least significant part tgt and tgt+2 must be switched: since the lower part must be written first and the lower part is the most significant one it must be written first.
# if the first half is < that the second one<tgt+2 (1st 2 bytes)> # where to write (hex, little endian)<tgt (2nd 2 bytes)> # where to write +2 (hex, little endian)%<low value - printed>c # what to write - #chars printed (dec), all printed chars so far should be considered%<pos>$hn # displacement on the stack (dec)%<high value - low value>c # what to write - what written (dec)%<pos+1>$hn # displacement on the stack +1 (dec)
In theory, instead of splitting tgt and tgt+2 the other two parts can be split as well (it is equivalent). The target address can be the saved return address (EIP, to be found on the stack), global offset table (GOT) or C library hooks, exception handlers, function pointers and other structures. Finding the desired address on the same is the same as with stack overflows.
Example
We want to exploit this vulnerability to overwrite the saved EIP with the address of the environment variable $EGG that contain your executable shellcode. Assuming we know that:
The address of $EGG (i.e., what to write) is 0x44674234 (0x4467(hex) = 17511(dec), 0x4234(hex) = 16948(dec))
The target address of the EIP (i.e., the address where we want to write ) is 0x42414515
The displacement on the stack of your format string is equal to
Write an exploit for the format string vulnerability to execute the shellcode in the environment variable EGG. Write the exploit clearly, detailing all the components of the format string, and detailing all the steps that lead to a successful exploitation.
The two target values are 0x42414515 and 0x42414517, the displacement is . The first thing to do decide which of the two values to write first: converting to decimal, we have 0x4467 -> 17511 and 0x4234 -> 16948, so we should write the second one first. Now, we have to write the lower value first at the address 0x42414515 and the higher value at the address 0x42414515 + 2. The next step is to compute the displacement of the two addresses on the stack: the first address is at position , so the second address is at position . Lastly, we have to compute <lower_val> = 16948 - 8 = 16940 and <higher_val> = 17511 - 16948 = 563. The final format string is:
Note that the addresses are written in little-endian format, so the least significant byte is written first.
Format String Bugs Countermeasures
Random canaries are not effective anymore, instead if address randomization (OS level) is active it is not possible to exploit format strings. Non-executable stack is only effective if we are jumping to some code on the stack (EIP is on the stack as well), instead is some return to libs is overwritten it is a useless countermeasure.
Memory error countermeasures help preventing exploitation in general. Modern compilers show warnings when potentially dangerous calls toprintf-like functions are found. Moreover, patched versions of libc to mitigate the problem exist.
To prevent format strings bugs, always specify the format of the input that will be inserted. This is the main countermeasure. If the format is specified, the function is not vulnerable. Other countermeasures are similar to those for buffer overflow: ROP, ret2libc, ASLR are the main ones. ROP and ret2libc can be used to bypass non-executable stack, ASLR can be used to bypass address randomization.
Affected functions
Conceptually, format strings bugs are not specific to printing functions: in theory any function with a unique combination of characteristics is potentially affected:
a variadic function (variable number of parameters “resolved” at runtime by pulling them from the stack)
a mechanism (placeholders) to (in)directly read/write arbitrary locations