Gotcha with String Null Character
Intro
This is a story of a pre-mature optimization that I started doing without proper analysis. It taught me that on different platforms we can achive different results by working with null character.
Some time ago I had a challenge to parse a Bank Account number, represented as a string, by type (IBAN, BankGiro, PlusGiro, etc..), and respective set of pre-defined rules. I will not go so deep into the techincal requirements, but the first step before the actual parser is coming into the game was to remove all whitespaces
in the account number. Knowing that we can have a huge array of bank account numbers, I decided to not use a simple accountNumber.Replace(" ", string.Empty)
and instead work with a StringBuilder
which will replace a white space character with a null symbol. That was a big mistake 😅
Null character
What is it? It is a special character that represents absolutely nothing:
Escape sequence | Character name | Unicode encoding |
---|---|---|
\0 | Null | Â 0x0000 |
My initial thought was: excellent, this is the best replacement for white spaces, and we don’t have to use string.Empty
.
You can find documentation on the Microsoft site https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/
Already in the first paragraph it says (which led me to my misunderstanding):
“There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters (’\0’)”
So now we have
|
|
What could be wrong, right? Hehe…
Testing
Lets test a small example:
|
|
If you run it at https://dotnetfiddle.net/ the following result will be shown:
|
|
Data is now looking like it’s without whitespaces, all good, and the lenght, wait… what?
I really did not believe it, must be some kind of bug at DotNetFiddle, I ran the same code in a simple console project on my MacBook using Rider and got exactly the same result, what the heck?
Looking at it more closely via debugger we can see:
Interesting, what are all those stack-lines? So it is true that the length is still 19.
Why is it happening
Well, I will not try to explain it all in my own words but instead I will quote another part of the Microsoft documentation (why is it so hard to find the right docs?) https://docs.microsoft.com/en-us/dotnet/api/system.string?redirectedfrom=MSDN&view=net-5.0#EmbeddedNulls
“In .NET, a
String
object can include embedded null characters, which count as a part of the string’s length. However, in some languages such as C and C++, a null character indicates the end of a string; it is not considered a part of the string and is not counted as part of the string’s length. This means that the following common assumptions that C and C++ programmers or libraries written in C or C++ might make about strings are not necessarily valid when applied toString
objects:
- The value returned by the strlen or wcslen functions does not necessarily equal
String.Length
.
- The string created by the strcpy_s or wcscpy_s functions is not necessarily identical to the string created by the
String.Copy
method.You should ensure that native C and C++ code that instantiates
String
objects, and code that is passedString
objects through platform invoke, don’t assume that an embedded null character marks the end of the string.Embedded null characters in a string are also treated differently when a string is sorted (or compared) and when a string is searched. Null characters are ignored when performing culture-sensitive comparisons between two strings, including comparisons using the invariant culture. They are considered only for ordinal or case-insensitive ordinal comparisons. On the other hand, embedded null characters are always considered when searching a string with methods such as
Contains
,StartsWith
, andIndexOf
.”
Summary
In this post I have shown that by trying to do a pre-mature optimization without profiling the worse case
solution I had an interesting gotcha moment for myself about how .net is working with strings and null characters.
The most interesting part is that at the end the size of an array that contains Bank Account Numbers was not so big that it requires handling extra memory optimizations, and a simple accountNumber.Replace(" ", string.Empty)
still works!
And to avoid confusion and misunderstanding either better to re-read all the possible documentations (as you can see the quote was provided from another documentation) or better to not even use some stuf that you have not tested.
And again, better to profile your application and then optimize it, stay safe!