This is the story of one programmer Quietly Going Insane With Tools & Automation:
Quietly Going Insane With Tools & AutomationStrings, Actually, Do Not ExistSuffering: The First Two Weeks Of ZigI’ve been lied to my entire career.
Strings, actually, do not exist.
I thought they existed.
Half the job is coercing json
into strings ‘n back to jank the desired result.
I was wrong.
Arbitrary Slices of Bytes
Naturally, in Zig, “Strings” are just arrays of integers.
// This resolves to `comptime int`?
const foo = "bar";
There is no God.
There are no Strings.
We're all just arbitrary slices of null terminated bytes.
Letters Are Numbers, Actually
I said that in Zig
, Strings are integers.
It would be more appropriate to say in C
, Strings are integers.
C, which Zig is based on, is apparently the master main programming language.
Main as in main branch, as in 90% of all programming languages is apparently just “C with extra steps”.
This is a descent into basic lower-level programming.
It started as an innocuous script to automate the deletion of 4 directories in a given path.
It ended up with me Quietly Going Insane With Tools & Automation
Which descended into full-on degeneracy with the decision to use Zig to do the thing.
It’s been an exercise in Degeneracy Driven Development.
Yup, The ol’ Triple D.
But it turns out the jokes on me, because Zigs std.fs.dir
represents paths as strings, and represents strings as *const [_:0] u8
.
*const [_:0] u8
is straight garbage.
Complete and total nonsense.
I’ve seen simpler Regex.
I’ve never had to know what a single-item pointer to a null-terminated byte array is before, but it don’t matter.
In DDD, you can only choose the more degenerate path.
Which means diving full on into what turns out to be basic computer stuff.
Basic computer stuff that everyone knows already, and I should feel ashamed for never learning it
OK so WTF is a String actually?
In Java/Node/Python/etc. the language provides abstractions to represent strings in a more programmer-friendly way.
Zig is unfriendly.
But Zig is your friend.
C
is not your friend.
AND it’s unfriendly.
C++
is the same as C, but with too much makeup. From the Official Zig Documentation:
String literals are constant single-item Pointers to null-terminated byte arrays.
That is a dense sentence.
Reminds me of when I tried to read GoF Design Patterns when I just started out programming.
It’s the kind of sentence where I need to research each word in order to parse it.
So, let’s exhaustively research each word in that sentence to parse it!
Byte Array
Here is the top websearch result I got for byte array.
It’s a question on StackOverflow closed 13 years ago(!) for being “too vague”.
byte array
is, the question is closed for not being a pedantically-correct question.
WTF is a Byte Array
is not an ambiguous question.
SO should have “websearch canonicality” as a weight for valid questions. It should be the canonical discussion for what a Byte Array is, now and historically.
Instead, it’s just another zombie page polluting a dead internet. Rant aside, here’s what we’re talking about:
const std = @import("std");
pub fn main() !void {
// Declare strings in two different ways:
const string_literal = "hello"; // *const [5:0]u8
const byte_array = [_]u8{ 'h', 'e', 'l', 'l', 'o' }; // [5]u8
// Both prints `hello`
try stdout.print("{s}\n", .{string_literal}); // hello
try stdout.print("{s}\n", .{byte_array}); // hello
// Both of these compile fail:
//
// const direct_assignment: *const [5:0]u8 = byte_array;
// const comp_fail_cast: *const [5:0]u8 = @as(*const [5:0]u8, byte_array);
//
// error: expected type '*const [5:0]u8', found '[5]u8'
}
All of that to say, [5]u8
is a byte array.
The array []
has exactly and only 5
items in it.
Each item u8
is a number that maps to a alphanumeric character.
That mapping is through UTF-8 source encoding, the same mapping used by HTML and pretty much everything else.
It’s kind of like everyone agreed to something like:
h = 104
So that anytime you see an array with the number 104 you can assume it’s h
UTF-8 is slightly more complicated than this mapping.
h = 104
is valid unicode
; A simple character set mapping characters to integers.
UTF-8 is a mapping where:
h = 01101000
The alphanumeric character h
is encoded to raw binary 01101000
Notice 01101000
contains 8 digits.
Each digit takes 1 bit
of memory.
8 bits equals 1 byte
.
And because each byte of raw binary will never be negative, an unsigned integer
is used instead of a signed integer
Good vibes only, no negative numbers expected, the lowest you can go is 0.
This u8
defines “unsigned 8 bits of memory” in:
const byte_array = [5]u8{ 'h', 'e', 'l', 'l', 'o' };
Null Termination
However, a Zig string is not just a byte array. It’s a null-terminated byte array.
If h = 01101000
then null = 00000000
This null character, represented as \0
, is implicitly added to the end of the byte array.
Functions that parse “strings” expect to find and treat \0
as the end of the string.
Our previous example had something like const arr = [_]u8 { 'h', 'h' };
We know that’s translated into [ 01101000 , 01101000 ]
And in memory just looks like 0110100001101000
But a null terminated array would be defined as const arr = [_:0] u8 { 'h', 'h' };
The [_:0]
- Defines an array
[]
- Infers the size of the array
_
based on the number of entries (2, in this case) - Makes it null terminated
:0
In this case, a null terminated byte array is stored as 011010000110100000000000
Single-Item Pointer
But a string literal is not just a null terminated byte array.
It’s a single-item Pointer to a null-terminated byte array.
A pointer is a memory address.
At the end of the day, our program is just a long line of 1’s and 0’s.
A memory address tells the computer to remember an arbitrary point in that long line of numbers.
So in our case of 011010000110100000000000
- a memory address is something that tells the computer to store the “9th digit” in this line of binary.
You need to tell the compiler how many bits to read from that point.
If we pass in a u8
, it knows to take the first 8 bits it finds starting from the 9th digit.
In that case, we would get 11010000
- which as we know is UTF-8 encoding for the character h
A single-item Pointer is a pointer that contains one, and only one address.
From the documentation (as of @March 20, 2024)
Zig has two kinds of pointers: single-item and many-item.
*T
- single-item pointer to exactly one item.
- Supports deref syntax:
ptr.*
[*]T
- many-item pointer to unknown number of items.
- Supports index syntax:
ptr[i]
This is only half the pointers listed in the docs. Cool.
The act of parsing memory from a pointer is what’s referred to as dereferencing
Consider the following:
const expect = @import("std").testing.expect;
test "address of syntax" {
// Get the address of a variable:
const x: i32 = 1234;
const x_ptr = &x;
// Dereference a pointer:
try expect(x_ptr.* == 1234);
}
x: i32 = 1234
says “use 32 bits representing an integer, initialize it with value 1234
, call it x
"
const x_ptr = &x
says “give me the memory address of x
and call it x_ptr
"
try expect(x_ptr.* == 1234)
says:
- Starting at the memory address (
x_ptr
), give me the next 32 bits you find (.*
). - Is it true (
==
) that those 32 bits you find encodes to1234
? - Test the result (
expect
), and tell the compiler it’s possible the result can be false (try
)
&x
is a reference to x
x_ptr.*
dereferences x to give us the value.
So all of the above, in code:
const std = @import("std");
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
// Obvious way to declare a string
// Non-obvious type this resolves to: `*const [5:0]u8`
const string_literal = "hello";
// Let's construct this a different way
// [] defines array;
// _ infers the size of array from the number of items added
// u8 stores each character using 8 bits of memory
// This is a byte_array that ends up being: `[5]u8`
const byte_array = [_]u8{
'h',
'e',
'l',
'l',
'o',
};
// Create a single-item pointer to the byte_array
// This results in *const [5]u8
const barr_ptr = &byte_array;
// Note that "barr_ptr"
// *const [5]u8
// is not the same as "string_literal"
// *const [5:0]u8
// This is because strings are *null terminated* byte arrays.
// [_:0] defines array, infers size, adds 00000000 (null in 8 bits) at the end
//
// The following resolves to `[5:0]u8`
//
const null_term_byte_array = [_:0]u8{
'h',
'e',
'l',
'l',
'o',
};
// Pointer to the null terminated byte array.
// This gives us *const [5:0]u8
// And this IS the same as `string_literal`
const null_barr_ptr = &null_term_byte_array;
//
// They all have the same value
//
try stdout.print("'string_literal' = {s}\n", .{string_literal});
try stdout.print("'byte_array' = {s}\n", .{byte_array});
try stdout.print("'barr_ptr.*' = {s}\n", .{barr_ptr.*});
try stdout.print("'null_term_byte_array' = {s}\n", .{null_term_byte_array});
try stdout.print("'null_term_byte_array.*' = {s}\n", .{null_barr_ptr.*});
// 'string_literal' = hello
// 'byte_array' = hello
// 'barr_ptr.*' = hello
// 'null_term_byte_array' = hello
// 'null_term_byte_array.*' = hello
//
// But they are not the same thing
//
try stdout.print("'string_literal' = {}\n", .{@TypeOf(string_literal)});
try stdout.print("'byte_array' = {}\n", .{@TypeOf(byte_array)});
try stdout.print("'barr_ptr' = {}\n", .{@TypeOf(barr_ptr)});
try stdout.print("'barr_ptr.*' = {}\n", .{@TypeOf(barr_ptr.*)});
try stdout.print("'null_term_byte_array' = {}\n", .{@TypeOf(null_term_byte_array)});
try stdout.print("'null_barr_ptr.*' = {}\n", .{@TypeOf(null_barr_ptr.*)});
// 'string_literal' = *const [5:0]u8
// 'byte_array' = [5]u8
// 'barr_ptr' = *const [5]u8
// 'barr_ptr.*' = [5]u8
// 'null_term_byte_array' = [5:0]u8
// 'null_barr_ptr.*' = [5:0]u8
//
// This is false:
//
if (std.meta.eql(&byte_array, string_literal)) {
try stdout.print("true\n", .{});
} else {
try stdout.print("false\n", .{});
}
//
// This is true:
//
if (std.meta.eql(&null_term_byte_array, string_literal)) {
try stdout.print("true\n", .{});
} else {
try stdout.print("false\n", .{});
}
}
Who Cares?
I have a few hours each day to grind on whatever random projects I have on the go, which means this dive into wtf a string, actually
is going on 5 days.
I’m not sure exactly how useful actually knowing what a const* [_:0] u8
is, other than some interesting trivia.
The good news is, the sentence “A string literal is a single-item pointer to a null terminated byte array” is now something I understand.
She already knew, and legit caught me mansplaining strings to her.
Rekt.
But jokes on her, imo, because she paid like what, $600 per class to learn this to get her compsci bachelors?
I got here with free tier ChatGPT and a bad attitude.
Rekt indeed.