Intro

This is a story of a pre-mature optimization that I started doing without proper analysis. It taught me that on different platforms we can achive different results by working with null character.

Some time ago I had a challenge to parse a Bank Account number, represented as a string, by type (IBAN, BankGiro, PlusGiro, etc..), and respective set of pre-defined rules. I will not go so deep into the techincal requirements, but the first step before the actual parser is coming into the game was to remove all whitespaces in the account number. Knowing that we can have a huge array of bank account numbers, I decided to not use a simple accountNumber.Replace(" ", string.Empty) and instead work with a StringBuilder which will replace a white space character with a null symbol. That was a big mistake 😅

Null character

What is it? It is a special character that represents absolutely nothing:

Escape sequence Character name Unicode encoding
\0 Null  0x0000

My initial thought was: excellent, this is the best replacement for white spaces, and we don’t have to use string.Empty .

You can find documentation on the Microsoft site https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/

Already in the first paragraph it says (which led me to my misunderstanding):

“There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters (’\0’)”

So now we have

1
accountNumber.Replace(` `, `\0`)

What could be wrong, right? Hehe…

Testing

Lets test a small example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
using System;
using System.Text;
					
public class Program
{
	public static void Main()
    {
        var originalData = "1234 5678 1234 5678";
        Console.WriteLine("Original data: " + originalData);
        Console.WriteLine("Original data length: " + originalData.Length);
        var modifiedData = new StringBuilder(originalData, originalData.Length).Replace(' ', '\0');
        Console.WriteLine("Modified data: " + modifiedData);
        Console.WriteLine("Modified data length: " + modifiedData.Length);

    }
}

If you run it at https://dotnetfiddle.net/ the following result will be shown:

1
2
3
4
Original data: 1234 5678 1234 5678
Original data length: 19
Modified data: 1234567812345678
Modified data length: 19

Data is now looking like it’s without whitespaces, all good, and the lenght, wait… what?

I really did not believe it, must be some kind of bug at DotNetFiddle, I ran the same code in a simple console project on my MacBook using Rider and got exactly the same result, what the heck?

Looking at it more closely via debugger we can see: Debugging session Interesting, what are all those stack-lines? So it is true that the length is still 19.

Why is it happening

Well, I will not try to explain it all in my own words but instead I will quote another part of the Microsoft documentation (why is it so hard to find the right docs?) https://docs.microsoft.com/en-us/dotnet/api/system.string?redirectedfrom=MSDN&view=net-5.0#EmbeddedNulls

“In .NET, a String object can include embedded null characters, which count as a part of the string’s length. However, in some languages such as C and C++, a null character indicates the end of a string; it is not considered a part of the string and is not counted as part of the string’s length. This means that the following common assumptions that C and C++ programmers or libraries written in C or C++ might make about strings are not necessarily valid when applied to String objects:

  • The value returned by the strlen or wcslen functions does not necessarily equal String.Length.
  • The string created by the strcpy_s or wcscpy_s functions is not necessarily identical to the string created by the String.Copy method.

You should ensure that native C and C++ code that instantiates String objects, and code that is passed String objects through platform invoke, don’t assume that an embedded null character marks the end of the string.

Embedded null characters in a string are also treated differently when a string is sorted (or compared) and when a string is searched. Null characters are ignored when performing culture-sensitive comparisons between two strings, including comparisons using the invariant culture. They are considered only for ordinal or case-insensitive ordinal comparisons. On the other hand, embedded null characters are always considered when searching a string with methods such as Contains, StartsWith, and IndexOf.”

Summary

In this post I have shown that by trying to do a pre-mature optimization without profiling the worse case solution I had an interesting gotcha moment for myself about how .net is working with strings and null characters.

The most interesting part is that at the end the size of an array that contains Bank Account Numbers was not so big that it requires handling extra memory optimizations, and a simple accountNumber.Replace(" ", string.Empty) still works!

And to avoid confusion and misunderstanding either better to re-read all the possible documentations (as you can see the quote was provided from another documentation) or better to not even use some stuf that you have not tested.

And again, better to profile your application and then optimize it, stay safe!