November 24, 2011

Signed Integers Considered Stupid (Like This Title)

Unrelated note: If you title your article "[x] considered harmful", you are a horrible person with no originality. Stop doing it.

Signed integers have always bugged me. I've seen quite a bit of signed integer overuse in C#, but it is most egregious when dealing with C/C++ libraries that, for some reason, insist on using for(int i = 0; i < 5; ++i). Why would you ever write that? i cannot possibly be negative and for that matter shouldn't be negative, ever. Use for(unsigned int i = 0; i < 5; ++i), for crying out loud.

But really, that's not a fair example. You don't really lose anything using an integer for the i value there because its range isn't large enough. The places where this become stupid are things like using an integer for height and width, or returning a signed integer count. Why on earth would you want to return a negative count? If the count fails, return an unsigned -1, which is just the maximum possible value for your chosen unsigned integral type. Of course, certain people seem to think this is a bad idea because then you will return the largest positive number possible. What if they interpret that as a valid count and try to allocate 4 gigs of memory? Well gee, I don't know, what happens when you try to allocate -1 bytes of memory? In both cases, something is going to explode, and in both cases, its because the person using your code is an idiot. Neither way is more safe than the other. In fact, signed integers cause far more problems then they solve.

One of the most painfully obvious issues here is that virtually every single architecture in the world uses the two's complement representation of signed integers. When you are using two's complement on an 8-bit signed integer type (a char in C++), the largest positive value is 127, and the largest negative value is -128. That means a signed integer can represent a negative number so large it cannot be represented as a positive number. What happens when you do (char)abs(-128)? It tries to return 128, which overflows back to... -128. This is the cause of a host of security problems, and what's hilarious is that a lot of people try to use this to fuel their argument that you should use C# or Java or Haskell or some other esoteric language that makes them feel smart. The fact is, any language with fixed size integers has this problem. That means C# has it, Java has it, most languages have it to some degree. This bug doesn't mean you should stop using C++, it means you need to stop using signed integers in places they don't belong. Observe the following code:
  if (*p == '*')
  {
    ++p;
    total_width += abs (va_arg (ap, int));
  }
This is retarded. Why on earth are you interpreting an argument as a signed integer only to then immediately call abs() on it? So a brain damaged programmer can throw in negative values and not blow things up? If it can only possibly be valid when it is a positive number, interpret it as a unsigned int. Even if someone tries putting in a negative number, they will serve only to make the total_width abnormally large, instead of potentially putting in -128, causing abs() to return -128 and creating a total_width that is far too small, causing a buffer overflow and hacking into your program. And don't go declaring total_width as a signed integer either, because that's just stupid. Using an unsigned integer here closes a potential security hole and makes it even harder for a dumb programmer to screw things up1.

I can only attribute the vast overuse of int to programmer laziness. unsigned int is just too long to write. Of course, that's what typedef's are for, so that isn't an excuse, so maybe they're worried a programmer won't understand how to put a -1 into an unsigned int? Even if they didn't, you could still cast the int to an unsigned int to serve the same purpose and close the security hole. I am simply at a loss as to why I see int's all over code that could never possibly be negative. If it could never possibly be negative, you are therefore assuming that it won't be negative, so it's a much better idea to just make it impossible for it to be negative instead of giving hackers 200 possible ways to break your program.


1 There's actually another error here in that total_width can overflow even when unsigned, and there is no check for that, but that's beyond the scope of this article.

13 comments:

  1. Yeah, I always use "unsigned" despite the extra space.

    http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

    ReplyDelete
  2. Actually its the other way around; unsigned integers really complicate things, introducing subtle bugs and have been responsible for a whole raft of real security vulnerabilities over the years. I remember at Symbian we recommended against them and I presume its related to why Java went to far as to banish them.

    ReplyDelete
  3. I call bullshit on that without examples of unsigned integers actually complicating things in ways not related to retarded programmers. If it is true, then it is not the other way around - it is simply integers in general causing problems, period, as the above example demonstrates.

    ReplyDelete
  4. Fair enough.

    VU#551436 CVE-2009-1385 CVE-2008-0600 etc

    ReplyDelete
  5. Signed and unsigned integers don't mix; when used in an expression, the signed integer is treated as unsigned.

    Its almost impossible to find some code that isn't littered with legitimate signed integers. They are really hard to avoid.

    Whereas the only place that unsigned creeps in is sizeof and size_t. And even then ptrdiff_t and so on are signed.

    As a consequence, its a real shame that sizeof and size_t etc weren't signed and unsigned were banished.

    If you have to pick one, try living without signed. On the other hand, you can easily live without unsigned.

    In Java bitwise arithmetic works fine despite normal maths being signed. And there's no-one struggling to write anything for lack of unsigned integers.

    ReplyDelete
  6. I'm not saying we should stop using signed integers. I just built a rational class that uses signed ints all over the place, including calling abs() on them. I'm saying there are signed integers being used in places they do not belong, and its stupid, and it causes problems. Moving everything to signed does not solve the problems, because you have issues with abs() and two's complement to deal with. So either your dealing with potential negative numbers that are impossible to make positive, or you are dealing with unsigned/signed mismatches. Banishing unsigned integers only serves to force the programmer to deal with the problem one way. This doesn't even get into issues with pointers and the fact that they technically have an unsigned integral range which can cause serious problems when using pointer arithmetic.

    So either you can try to use only signed integers and deal with all the problems that introduces, or you can use unsigned integers where possible and deal with potential unsigned/signed mismatches. The real thing I'm complaining about here are places people use signed integers when it serves only to cause problems. Ideally, one uses unsigned integers where using a signed integer introduces problems, and then use a signed integer where an unsigned integer causes problems. This insistence on using only signed integers is exactly what I am complaining about - both should be used where they belong.

    ReplyDelete
  7. Google say it way better than I. Look at their style-guide http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Integer_Types

    "Some people, including some textbook authors, recommend using unsigned types to represent numbers that are never negative. This is intended as a form of self-documentation. However, in C, the advantages of such documentation are outweighed by the real bugs it can introduce. Consider:

    for (unsigned int i = foo.Length()-1; i >= 0; --i) ...

    This code will never terminate! Sometimes gcc will notice this bug and warn you, but often it will not. Equally bad bugs can occur when comparing signed and unsigned variables. Basically, C's type-promotion scheme causes unsigned types to behave differently than one might expect.

    So, document that a variable is non-negative using assertions. Don't use an unsigned type."

    ReplyDelete
  8. I like a lot of what Google does, but not that. If you try to use an unsigned type and do not understand the implications of overflowing, you won't understand the OPPOSITE problem of abs(INT_MIN) returning a negative number either, and so you have simply traded one problem for another. If you are going to insist on using signed integers, you have no excuse for introducing a stupid security hole in your application. Enforcing the use of signed integers succeeds only in replacing more obvious bugs with even more subtle but equally as deadly bugs. I am dead set against this retarded mentality of using signed integers everywhere, especially when they simply replace one kind of bug with another kind of bug.

    ReplyDelete
  9. Hi Erik. Kind of late to reply, just wanted to share two thoughts after reading. The first is that your argument here is weak. The issue you've identified isn't one of signed vs. unsigned, it's one of overflow errors. Many implementations of the binary search algorithm were (and some still are) broken because they have low and high declared as (signed or unsigned, doesn't matter) integers and compute (low + high) / 2 for the midpoint instead of low + (high - low) / 2. Using unsigned vs. signed may save you from the abs() problem but doesn't save you from other overflow problems that are just as likely to happen if you're dealing with numbers near the limits. The solution of course is to defensively code by strictly filtering foreign input or asserting boundaries or using data types guaranteed not to overflow.

    My second thought is more on why unsigned is important but not so much for integers. Java 8 is -finally- adding support for unsigned integers, it's about time. That particular defect of forbidding unsigned was unnecessary, just like forbidding multiple inheritance that resolves conflicts by parent-order like Python does. Anyway, the reason having unsigned is important is because unsigned integers are a natural and efficient way to construct many algorithms and data types that work on bit strings with no concept of type. If you have to use signed integers as a basis the code quickly becomes slow, complicated, unwieldy, and stupidly difficult to maintain and reason about because suddenly particular bits in the bit strings are special under certain operations, and compilers will end up generating a lot of inefficient code.

    ReplyDelete
  10. Who the hell would do "for (unsigned int i = foo.Length()-1; i >= 0; --i)"?
    That's just idiotic, because you are specifically saying that it has to be positive, and then giving a condition that says that it needs to be negative for it to end. You might as well say "while(true)", because a non-negative number obviously cannot be negative.

    ReplyDelete
    Replies
    1. You can also avoid that entirely just by going

      for(unsigned int i = foo.Length(); i-- > 0;)

      which I think is more elegant anyway.

      Delete
  11. I think there are some interesting posts on your blog, but I disagree with you on this. The conventional wisdom on this matter is as presented in Google's C++ style guide, and I've come to agree that it's the correct approach.

    You cannot use unsigned everywhere or restrict signed to special cases. Too many standard APIs return long or int, many more than use unsigned or types such as size_t. Very often you want to be able to pass signed values around, you really need to represent negative values, or you want functions to accept signed simply so that users don't have be casting unsigned to signed all the time (since the values they're using very likely have to work together with something using signed at some point).
    Unsigned, on the other hand, can be restricted to special cases. E.g., cases where you need defined wrapping behaviour, cases where the extra positive range really is useful (e.g. uint8_t image components, file access APIs with only a 32 bit offset), cases where you're operating on a binary data format that defines unsigned fields, etc.
    If you do mix up signed and unsigned all through your code base, you need to engage in a lot of inconvenient and error-prone casting between them. I think it's best to choose to use one or the other in general, and due to the above reasons it has to be signed.

    You cannot rely on overflow of a signed value by 1 giving a negative value, it could very well just go up to a higher value than you thought possible because the compiler decided that behaviour was more efficient to implement (e.g., scalar int32_t arithmetic on a machine with 64b registers). This is because signed overflow is undefined. Unsigned overflow, on the other hand, is defined. E.g., ensuring standards-compliant wrapping for 32b values on a CPU with 64b registers can potentially incur a cost.
    There are cases where unsigned can give a performance benefit, but in general they're more rare. E.g., handling 64b integers on a 32b CPU, dividing by a power of two, absolute value, etc.

    In general, in terms of detecting bugs quickly for values/arguments expected to be positive, the behaviour of signed types is useful. We can easily assert(a_value >= 0). I suppose potentially you should also assert(a_value <= std::numeric_limits::max()). Even though the behaviour is undefined, and if you're doing something crazy enough you could wrap right back around to something in the expected range, you can almost always detect an argument that's gone outside the expected range. The number of times I've had such assertions pick up an issue quickly during testing, or pick up precondition violations due to bugs in the code calling my functions, is huge.
    For an unsigned integer, because it's guaranteed to wrap back to another valid positive value by the C++ standard, all we can really do is check against a maximum "sensible" value. That's definitely not self-documenting behaviour, could be unexpected, and defining such a value could be problematic.
    Considering that we have a choice between picking up many cases of unexpected value bugs with assertions (e.g., due to operations inadvertently wrapping or going negative), and probably not picking up any of them without selecting a maximum value to support, signed is a definite improvement.

    The other reason that I can think of to prefer signed values is that they're generally more intuitive to use. I'm sure we've all written something like that Google's example at some stage, ridiculous as it looks. Potentially not that obvious but the ability to go negative is natural in mathematics. In fact, I think the examples of potential bugs you gave in your post seem more unlikely, and the overflow issue you linked to is not something that's magically fixed by using unsigned types.

    I would say more, but you're probably sick of me by now and I've hit a character limit :). There are reasons that aren't completely without sense for using signed integers except in special cases, anyway.

    ReplyDelete
    Replies
    1. But you have missed the entire point of the blog. I was arguing against the OVERUSE of signed integers, not that we should stop using them. I'm arguing that both signed and unsigned have their places. I am arguing against using signed integers in places where you should very obviously use an unsigned integer. On a completely unrelated note, I do hope your aware that size_t is just an unsigned long (4 bytes on x86 and 8 bytes on x64). I am complaining about people not even considering using unsigned ints in places where they belong. I'm complaining about there being obvious uses for unsigned ints, as you mentioned in your post, where people use signed integers anyway because unsigned is "bad" for a bunch of stupid reasons.

      Even then the ability for you to check ranges using an unsigned integer is very useful, because it requires only a single check against length to verify that an index is correct. If a value of more than -1 got in there, you have bigger problems.

      I'm not saying we should only used unsigned, I'm saying we should use unsigned integers where its sensible instead of only using signed integers because that's retarded.

      Delete