Home

2024

Worklog

LETSGO Game

Strings, Actually, Do Not Exist
🧵

Strings, Actually, Do Not Exist

Tags
ZigCEngineering
Owner
Justin Nearing
💡
In the long dark of Winter 2024, I went down a programming rabbit hole. It wasn’t at optimal mental health, I don’t think this particularly helped, but I learned a lot.

This is the story of one programmer Quietly Going Insane With Tools & Automation:

🌋Quietly Going Insane With Tools & Automation🧵Strings, Actually, Do Not ExistSuffering: The First Two Weeks Of Zig

I’ve been lied to my entire career.

Strings, actually, do not exist.

I thought they existed.

Half the job is coercing json into strings ‘n back to jank the desired result.

I was wrong.

Arbitrary Slices of Bytes

Naturally, in Zig, “Strings” are just arrays of integers.

// This resolves to `comptime int`?
const foo = "bar";
image

There is no God.

There are no Strings.

We're all just arbitrary slices of null terminated bytes.

Letters Are Numbers, Actually

This document has been presented to you by the ‘always has been meme’
This document has been presented to you by the ‘always has been meme’

I said that in Zig, Strings are integers.

It would be more appropriate to say in C, Strings are integers.

C, which Zig is based on, is apparently the master main programming language.

Main as in main branch, as in 90% of all programming languages is apparently just “C with extra steps”.

🪐
Below C, unknown, The few who knew, have all moved, They now call Mars home.

This is a descent into basic lower-level programming.

It started as an innocuous script to automate the deletion of 4 directories in a given path.

It ended up with me 🌋Quietly Going Insane With Tools & Automation

Which descended into full-on degeneracy with the decision to use Zig to do the thing.

It’s been an exercise in Degeneracy Driven Development.

Yup, The ol’ Triple D.

But it turns out the jokes on me, because Zigs std.fs.dir represents paths as strings, and represents strings as *const [_:0] u8.

*const [_:0] u8 is straight garbage.

Complete and total nonsense.

I’ve seen simpler Regex.

I’ve never had to know what a single-item pointer to a null-terminated byte array is before, but it don’t matter.

In DDD, you can only choose the more degenerate path.

Which means diving full on into what turns out to be basic computer stuff.

Basic computer stuff that everyone knows already, and I should feel ashamed for never learning it

image

OK so WTF is a String actually?

In Java/Node/Python/etc. the language provides abstractions to represent strings in a more programmer-friendly way.

Zig is unfriendly.

But Zig is your friend.

👹
To make sense of that, consider that C is not your friend. AND it’s unfriendly. C++ is the same as C, but with too much makeup.

From the Official Zig Documentation:

String literals are constant single-item Pointers to null-terminated byte arrays.

That is a dense sentence.

Reminds me of when I tried to read GoF Design Patterns when I just started out programming.

It’s the kind of sentence where I need to research each word in order to parse it.

So, let’s exhaustively research each word in that sentence to parse it!

Byte Array

Here is the top websearch result I got for byte array.

It’s a question on StackOverflow closed 13 years ago(!) for being “too vague”.

🌶️
This is the perfect example of why StackOverflow has cratered in relevance in the modern programming ecosystem. Instead of becoming the canonical websearch answer for what a byte array is, the question is closed for not being a pedantically-correct question. WTF is a Byte Array is not an ambiguous question. SO should have “websearch canonicality” as a weight for valid questions. It should be the canonical discussion for what a Byte Array is, now and historically. Instead, it’s just another zombie page polluting a dead internet.

Rant aside, here’s what we’re talking about:

const std = @import("std");

pub fn main() !void {

    // Declare strings in two different ways:
    const string_literal = "hello"; // *const [5:0]u8
    const byte_array = [_]u8{ 'h', 'e', 'l', 'l', 'o' }; // [5]u8

    // Both prints `hello`
    try stdout.print("{s}\n", .{string_literal}); // hello
    try stdout.print("{s}\n", .{byte_array}); // hello

    // Both of these compile fail:
    //
    // const direct_assignment: *const [5:0]u8 = byte_array;
    // const comp_fail_cast: *const [5:0]u8 = @as(*const [5:0]u8, byte_array);
    // 
    // error: expected type '*const [5:0]u8', found '[5]u8'
}
This was my first attempt at gettin’ stringy with it.

All of that to say, [5]u8 is a byte array.

The array [] has exactly and only 5 items in it.

Each item u8 is a number that maps to a alphanumeric character.

That mapping is through UTF-8 source encoding, the same mapping used by HTML and pretty much everything else.

It’s kind of like everyone agreed to something like:

h = 104

So that anytime you see an array with the number 104 you can assume it’s h

UTF-8 is slightly more complicated than this mapping.

h = 104 is valid unicode; A simple character set mapping characters to integers.

UTF-8 is a mapping where:

h = 01101000

The alphanumeric character h is encoded to raw binary 01101000

Notice 01101000 contains 8 digits.

Each digit takes 1 bit of memory.

8 bits equals 1 byte.

And because each byte of raw binary will never be negative, an unsigned integer is used instead of a signed integer

Good vibes only, no negative numbers expected, the lowest you can go is 0.

This u8 defines “unsigned 8 bits of memory” in:

const byte_array = [5]u8{ 'h', 'e', 'l', 'l', 'o' }; 

Null Termination

However, a Zig string is not just a byte array. It’s a null-terminated byte array.

If h = 01101000

then null = 00000000

This null character, represented as \0, is implicitly added to the end of the byte array.

Functions that parse “strings” expect to find and treat \0 as the end of the string.

Our previous example had something like const arr = [_]u8 { 'h', 'h' };

We know that’s translated into [ 01101000 , 01101000 ]

And in memory just looks like 0110100001101000

But a null terminated array would be defined as const arr = [_:0] u8 { 'h', 'h' };

The [_:0]

  • Defines an array []
  • Infers the size of the array _ based on the number of entries (2, in this case)
  • Makes it null terminated :0

In this case, a null terminated byte array is stored as 011010000110100000000000

Single-Item Pointer

But a string literal is not just a null terminated byte array.

It’s a single-item Pointer to a null-terminated byte array.

A pointer is a memory address.

At the end of the day, our program is just a long line of 1’s and 0’s.

A memory address tells the computer to remember an arbitrary point in that long line of numbers.

So in our case of 011010000110100000000000 - a memory address is something that tells the computer to store the “9th digit” in this line of binary.

💡
Note that it only stores that location. The only thing this pointer knows is the location of the “9th digit”. When using that pointer- that memory address of the 9th digit- it’s only used as the starting location.

You need to tell the compiler how many bits to read from that point.

If we pass in a u8, it knows to take the first 8 bits it finds starting from the 9th digit. In that case, we would get 11010000- which as we know is UTF-8 encoding for the character h

A single-item Pointer is a pointer that contains one, and only one address.

From the documentation (as of @March 20, 2024)

Zig has two kinds of pointers: single-item and many-item.
  • *T - single-item pointer to exactly one item.
    • Supports deref syntax: ptr.*
  • [*]T - many-item pointer to unknown number of items.
    • Supports index syntax: ptr[i]

This is only half the pointers listed in the docs. Cool.

The act of parsing memory from a pointer is what’s referred to as dereferencing

Consider the following:

const expect = @import("std").testing.expect;

test "address of syntax" {
    // Get the address of a variable:
    const x: i32 = 1234;
    const x_ptr = &x;
    
    // Dereference a pointer:
    try expect(x_ptr.* == 1234);
}

x: i32 = 1234 says “use 32 bits representing an integer, initialize it with value 1234, call it x"

const x_ptr = &x says “give me the memory address of x and call it x_ptr"

try expect(x_ptr.* == 1234) says:

  • Starting at the memory address (x_ptr), give me the next 32 bits you find (.*).
  • Is it true (==) that those 32 bits you find encodes to 1234 ?
  • Test the result (expect), and tell the compiler it’s possible the result can be false (try)

&x is a reference to x

x_ptr.* dereferences x to give us the value.

So all of the above, in code:

const std = @import("std");

pub fn main() !void {   
    const stdout = std.io.getStdOut().writer();

    // Obvious way to declare a string
    // Non-obvious type this resolves to: `*const [5:0]u8`
    const string_literal = "hello"; 

    // Let's construct this a different way
    // [] defines array; 
    // _ infers the size of array from the number of items added
    // u8 stores each character using 8 bits of memory 
    // This is a byte_array that ends up being: `[5]u8`
    const byte_array = [_]u8{
        'h',
        'e',
        'l',
        'l',
        'o',
    };

    // Create a single-item pointer to the byte_array
    // This results in *const [5]u8
    const barr_ptr = &byte_array; 

		// Note that "barr_ptr"
		//   *const [5]u8 
		// is not the same as "string_literal"
		//   *const [5:0]u8
    // This is because strings are *null terminated* byte arrays.
    // [_:0] defines array, infers size, adds 00000000 (null in 8 bits) at the end
    //
    // The following resolves to `[5:0]u8`
    //
    const null_term_byte_array = [_:0]u8{
        'h',
        'e',
        'l',
        'l',
        'o',
    };

    // Pointer to the null terminated byte array.
    // This gives us *const [5:0]u8
    // And this IS the same as `string_literal`
    const null_barr_ptr = &null_term_byte_array;

    //
    // They all have the same value
    //
    try stdout.print("'string_literal' = {s}\n", .{string_literal});
    try stdout.print("'byte_array' = {s}\n", .{byte_array});
    try stdout.print("'barr_ptr.*' = {s}\n", .{barr_ptr.*});
    try stdout.print("'null_term_byte_array' = {s}\n", .{null_term_byte_array});
    try stdout.print("'null_term_byte_array.*' = {s}\n", .{null_barr_ptr.*});
	    // 'string_literal'         = hello
	    // 'byte_array'             = hello
	    // 'barr_ptr.*'             = hello
	    // 'null_term_byte_array'   = hello
	    // 'null_term_byte_array.*' = hello

    //
    // But they are not the same thing
    //
    try stdout.print("'string_literal' = {}\n", .{@TypeOf(string_literal)});
    try stdout.print("'byte_array' = {}\n", .{@TypeOf(byte_array)});
    try stdout.print("'barr_ptr' = {}\n", .{@TypeOf(barr_ptr)});
    try stdout.print("'barr_ptr.*' = {}\n", .{@TypeOf(barr_ptr.*)});
    try stdout.print("'null_term_byte_array' = {}\n", .{@TypeOf(null_term_byte_array)});
    try stdout.print("'null_barr_ptr.*' = {}\n", .{@TypeOf(null_barr_ptr.*)});
	    // 'string_literal'       = *const [5:0]u8
	    // 'byte_array'           = [5]u8
	    // 'barr_ptr'             = *const [5]u8
	    // 'barr_ptr.*'           = [5]u8
	    // 'null_term_byte_array' = [5:0]u8
	    // 'null_barr_ptr.*'      = [5:0]u8

    //
    // This is false:
    //
    if (std.meta.eql(&byte_array, string_literal)) {
        try stdout.print("true\n", .{});
    } else {
        try stdout.print("false\n", .{});
    }

    //
    // This is true:
    //
    if (std.meta.eql(&null_term_byte_array, string_literal)) {
        try stdout.print("true\n", .{});
    } else {
        try stdout.print("false\n", .{});
    }
}

Who Cares?

I have a few hours each day to grind on whatever random projects I have on the go, which means this dive into wtf a string, actually is going on 5 days.

I’m not sure exactly how useful actually knowing what a const* [_:0] u8 is, other than some interesting trivia.

The good news is, the sentence “A string literal is a single-item pointer to a null terminated byte array” is now something I understand.

♀️
My wife, who is smarter than I am, laughed at me when I explained this to her.

She already knew, and legit caught me mansplaining strings to her.

Rekt.

But jokes on her, imo, because she paid like what, $600 per class to learn this to get her compsci bachelors?

I got here with free tier ChatGPT and a bad attitude.

Rekt indeed.

Here’s the thing. I only did any of this so that I could use Zig to finish the story in 🌋Quietly Going Insane With Tools & Automation That was published @March 13, 2024 I’m writing this @March 28, 2024 Point is, we’re going on a solid month to clear 4 directories from an Unreal project directory, in a task I chose to do purely for the memes. Truly, I embody the statement:

Suffering: The First Two Weeks Of Zig