Unicode Explained: Convert Text to Unicode Code Points

Unicode is the international standard that assigns a unique code point to every character in every writing system used by humans. It covers more than 140,000 characters including the Latin alphabet, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Devanagari, emoji, mathematical symbols, musical notation, ancient scripts, and much more. Without Unicode, the web would be fractured into incompatible national encoding systems.

What Is a Code Point?

A Unicode code point is a number assigned to a character. It is written in U+ notation followed by four to six hexadecimal digits. The letter A has code point U+0041. The copyright symbol has code point U+00A9. The emoji for a smiling face has code point U+1F600. The code point is the identity of the character; the encoding (UTF-8, UTF-16, UTF-32) determines how that code point is stored in bytes.

UTF-8 and How It Works

UTF-8 is the dominant Unicode encoding on the web. It is a variable-length encoding where characters with code points below 128 are stored as a single byte (identical to ASCII), and characters with higher code points are stored as 2, 3, or 4 bytes. This makes UTF-8 backward compatible with ASCII while supporting every Unicode character.

The emoji U+1F600 requires 4 bytes in UTF-8: F0 9F 98 80. This is why emoji sometimes cause bugs in systems that expect one character per byte.

Why Developers Need Unicode Tools

Common programming problems that require Unicode awareness include:

Measuring string length correctly. The JavaScript length property of a string counts UTF-16 code units, not characters. An emoji that uses a surrogate pair has a length of 2 even though it is one character. Getting the correct visual character count requires iterating Unicode code points.

Normalisation. The letter e with an accent can be encoded as a single code point (U+00E9) or as the letter e followed by a combining accent mark (U+0065 + U+0301). These look identical but are not the same byte sequence. Comparing or sorting strings correctly requires Unicode normalisation.

Escaping. In JSON and many programming languages, non-ASCII characters can be escaped as Unicode escape sequences. Knowing the code point lets you write the escape correctly.

How to Use the DevHexLab Unicode Converter

Open the tool at /tools/encoding/unicode-converter. Paste any text. The tool shows each character with its code point in U+ notation and its UTF-8 byte sequence. You can also paste Unicode escape sequences to decode them back to readable characters.

Frequently Asked Questions

What is the difference between Unicode and UTF-8?

Unicode is the standard that defines code points. UTF-8 is an encoding that specifies how code points are stored as bytes. UTF-16 and UTF-32 are other encodings of the same Unicode code points.

Why do some emoji appear as two characters?

Emoji above code point U+FFFF require surrogate pairs in UTF-16 and have a string length of 2 in JavaScript. They are still one character in terms of Unicode code points and one visible glyph.

Understanding Unicode removes one of the most mysterious categories of bugs in text-processing code.