One smile, three answers
Type "π" into a Twitter compose box and the counter goes up by one. Type the same character into a JavaScript console and 'π'.length is 2. Send it to a Postgres column and four bytes hit the disk. None of these is wrong. They're answering different questions about the same string β and the gap between those answers is where every painful string bug lives.
Code units, not characters
A string isn't a row of "letters." It's a row of integers, and an encoding rule decides what those integers mean. JavaScript and Java pick UTF-16: each integer is 16 bits wide, called a code unit. The string "hi" is the row [0x0068, 0x0069] β two code units, two boxes, same shape as the array from the previous lesson.
function get(s, i) {
// 16-bit code unit at offset i Γ 2 bytes.
return s.charAt(i);
}
For ASCII text β "hi", "router-config-v3", anything you'd type without holding the option key β one code unit equals one visible character, and the array model carries over cleanly. s.length is the count, s[i] is the character, life is simple.
The visualiser above shows that simple case first: a UTF-16 row for "hi", with each cell holding a hex code unit and a rendered text strip below it that interprets the integers. Read s[0]. Same shape as arr[0]. Then watch what happens when the smile lands.
What "modify" really means
Strings are immutable. Unlike the array from the previous lesson, s[2] = 'x' is not a thing β there is no in-place write. That's why every row in the visualiser shows up with a lock icon the moment it's committed. Concatenation, slicing, replacement β all of them allocate a new row and copy.
function concat(a, b) {
// Strings are immutable: a and b are unchanged.
return a + b; // allocates a new row
}
Watch what happens when the visualiser evaluates "hi" + "!". The original "hi" row stays exactly where it was β locked, unchanged, still readable from any other reference. A new row is allocated below, three cells wide, with the integers from "hi" and "!" copied in. Nothing was edited. Something new was built.
The visualiser shows two locked rows: row a = [0x0068, 0x0069] (rendered "hi") and row b = [0x0021] (rendered "!"). We're about to evaluate a + b. Which picture matches the state immediately after the expression evaluates?
This is the part that turns innocent-looking code into performance traps. If concatenation always copies, then concatenation in a loop copies the same prefix over and over.
The Schlemiel-the-Painter trap
Here is the canonical bad version. It looks fine. It works. It will also melt your laptop on a long input.
let result = "";
for (let i = 0; i < parts.length; i++) {
result = result + parts[i]; // each += allocates
}
Each iteration takes the current result, allocates a new row one cell longer, and copies everything that was already there plus the new piece. After 4 iterations, the total work isn't 4 β it's 1 + 2 + 3 + 4 = 10. After n iterations, it's n(n+1)/2. Quadratic.
A loop runs result = result + parts[i] for i = 0, 1, 2, 3 starting from result = "". Each += allocates a new row and copies everything accumulated so far plus the new code unit. How many code units have been copied in total after all 4 iterations?
The fix is the same one Java's StringBuilder and Go's strings.Builder were invented for: collect into a buffer, allocate once at the end.
const buf = [];
for (let i = 0; i < parts.length; i++) {
buf.push(parts[i]); // O(1) push
}
const result = buf.join(""); // single allocation
Same answer, linear cost. The visualiser shows both side-by-side: red counter for the quadratic version that visibly accelerates, green counter for the linear version that ticks once at the end. The shapes don't lie.
You want to build a single string from n substrings without quadratic cost. Drag these four steps into the right order.
- 1.Allocate a fresh array buf = []
- 2.Iterate from i = 0 to i = n - 1
- 3.Inside the loop, do buf.push(parts[i])
- 4.After the loop, do result = buf.join("")
V8 and other modern engines do soften the quadratic case via internal "rope" structures that defer the copy until the string is actually read. Don't rely on this. The optimisation is fragile; the moment something logs the partial result or compares it, the rope flattens and you pay the bill.
When length stops meaning "characters"
Now the smile. Append "π" to "hi!" and the new row is five cells wide, not four. The smile is a code point above U+FFFF β outside what 16 bits can represent β so UTF-16 stores it as a surrogate pair: two code units that, taken together, encode one character. The visualiser draws a thin bracket above those two cells with a 1 code point label, but s.length doesn't see the bracket. It just counts cells.
In JavaScript, what does 'π'.length evaluate to?
If you got that one off, you're in the same place as every team that has ever shipped a "280-character tweet" feature. JavaScript's length answers a question β how many UTF-16 code units? β that almost never matches the question your product wants answered, which is how many characters does a human see?. Twitter computes its own count. So does every input-validation library worth using.
Indexing returns half a smile
Here's the real trap. If you index into a surrogate pair, you don't get the character. You get half of it.
The current row is [0x0068, 0x0069, 0x0021, 0xD83D, 0xDE00], rendered "hi!π". We index s[3]. What does the returned 1-element string render as?
The cell at index 3 holds 0xD83D β a high surrogate. On its own, it isn't a valid Unicode character. Renderers display it as οΏ½, JSON encoders complain about lone surrogates, regex engines refuse to match it. The lesson isn't "indexing is broken." The lesson is that s[i] answers a question about code units, not a question about characters. To iterate by character, use for..of or spread [...s] β those yield code points, not code units.
The same trap shows up when slicing.
This function tries to return the first half of any string. For most inputs it works. For one of these inputs, it returns a string with a broken character. Which input? function firstHalf(s) { return s.slice(0, s.length / 2); }
slice(0, length / 2) is character-safe for ASCII strings and mostly safe for European text (which fits in BMP code points), but it can bisect a surrogate pair on any input that has emoji, mathematical symbols, or non-BMP CJK extensions. In a search-suggestions feature, that's a οΏ½ showing up in the dropdown. In a logs pipeline, it's a downstream parser crashing on an invalid UTF-8 sequence after a re-encode.
Where the trade-offs sit
A raw string is brilliant when the workload is:
- A few values to read or compare.
- Concatenations done once, not in a loop.
- Text where one code unit equals one character (ASCII, most European text in NFC form).
It's the wrong primitive when:
- You're building output incrementally in a loop. Use an array buffer and
.join(""), or stream directly to wherever the bytes need to go. - You need to count characters as a human sees them. Use a grapheme-cluster library; do not trust
length. - You're slicing arbitrary user-supplied text. Either iterate by code point, or accept that you might cut a surrogate pair and add a normalisation step.
- You need locale-aware comparison or case folding.
===andtoLowerCase()are not what you want; pass an explicit locale tolocaleCompareandtoLowerCase, and never use case-insensitive equality for security checks.
For which workload is "treat strings as immutable arrays of code units, build via += in a loop" the worst fit?
Where this shows up in real systems
A few places the integer-array view of strings turns from interesting to load-bearing:
- JSON parsers: parsing performance is dominated by string handling β finding quote boundaries, validating UTF-8, unescaping. simdjson hits gigabytes per second by treating JSON's strings as raw byte arrays and using SIMD to scan them. Source: simdjson.org.
- V8 string representation: V8 internally has at least four different string layouts β
SeqOneByteString(Latin-1 packed, half the memory),SeqTwoByteString(UTF-16),ConsString(concatenation tree, lazy), andSlicedString(a view into a parent string). Callingconsole.log(s)on a deeply concatenated string sometimes triggers a "flatten" that walks the cons tree and allocates a fresh contiguous buffer. - Java compact strings (JEP 254): since JDK 9, Java stores strings as
byte[]plus a 1-byte coder field. Pure Latin-1 strings take half the memory of pre-9 UTF-16 strings. Most real Java apps got a measurable heap reduction the day they upgraded. - Postgres TEXT / VARCHAR: stored inline up to ~2 KB, then transparently TOASTed (compressed and moved out-of-line) for larger values. The encoding is a database-level setting; mismatched encodings between client and server are the classic cause of
?corruption that survives every "fix" except actually agreeing on UTF-8. - DNS labels: each label in a domain name is at most 63 bytes, total wire-form name at most 255. Non-ASCII domains go through Punycode β
mΓΌnchen.debecomesxn--mnchen-3ya.deon the wire. The whole protocol assumes a fixed, byte-counted encoding.
The three answers from the start β Twitter says one, JavaScript says two, Postgres says four β were each correct under their own encoding rule. Once you see strings as integer arrays with a rule on top, the disagreement stops feeling mysterious and starts feeling like exactly what it is: three different rules, three different counts, no single right answer.
What to learn next
- Hash maps β when you need O(1) lookup by a string key, and the string's
hashCode()becomes the operation that matters. - Two pointers β the first pattern that exploits left-to-right traversal of a string for problem solving.
- KMP and Rabin-Karp β substring search that beats the naΓ―ve O(n Γ m) we glossed past.