WIP: [GSoC] Add initial libunicode parser and example #356

Draft
codewithchill wants to merge 5 commits from codewithchill/kolibrios:libutf into main
First-time contributor

Hi Ivan and team,

This PR introduces the initial setup for libunicode as part of my GSoC qualification task.

What is included:

  • /programs/develop/libraries/libunicode/libunicode.asm: Contains the core parsing logic.
    • count_utf8_codepoints: Counts raw Unicode values.
    • count_utf8_graphemes: Counts visual characters (includes logic to subtract counts for Zero-Width Joiners and basic combining marks).
  • /programs/develop/libraries/libunicode/examples/console.asm: A test application that passes a complex UTF-8 string to the functions and prints the results to the console.

I have tested this against standard ASCII, Russian text, complex ZWJ emojis, and accented characters.

I am looking forward to your feedback on KolibriOS code style, formatting, and best practices so I can update this to match the official OS standards!

Hi Ivan and team, This PR introduces the initial setup for `libunicode` as part of my GSoC qualification task. **What is included:** * `/programs/develop/libraries/libunicode/libunicode.asm`: Contains the core parsing logic. * `count_utf8_codepoints`: Counts raw Unicode values. * `count_utf8_graphemes`: Counts visual characters (includes logic to subtract counts for Zero-Width Joiners and basic combining marks). * `/programs/develop/libraries/libunicode/examples/console.asm`: A test application that passes a complex UTF-8 string to the functions and prints the results to the console. I have tested this against standard ASCII, Russian text, complex ZWJ emojis, and accented characters. I am looking forward to your feedback on KolibriOS code style, formatting, and best practices so I can update this to match the official OS standards!
dunkaist reviewed 2026-03-09 14:16:47 +00:00
@@ -0,0 +1,81 @@
;=============================================================
Owner

The current libunicode.asm is a set of functions. To use them, a user has to include the file into each program. This means the code of libunicode.asm is duplicated for each program using it. Let's turn libunicode.asm into a dynamically loadable library like e.g. libcrash. Then you can put libunicode.obj to /sys/lib/ and use it just like you are using console.obj.

The current libunicode.asm is a set of functions. To use them, a user has to include the file into each program. This means the code of libunicode.asm is duplicated for each program using it. Let's turn libunicode.asm into a dynamically loadable library like e.g. [libcrash](https://git.kolibrios.org/KolibriOS/kolibrios/src/branch/main/programs/develop/libraries/libcrash/libcrash.asm). Then you can put libunicode.obj to /sys/lib/ and use it just like you are using console.obj.
Author
First-time contributor

I will do it later, later on when I get more used to the file type and dynamic libraries.

I will do it later, later on when I get more used to the file type and dynamic libraries.
Owner

Ping me during this weekend if you can't convert this code to a dynamic library. We will take a look

Ping me during this weekend if you can't convert this code to a dynamic library. We will take a look
Author
First-time contributor

I guess this current commit will convert into a library as per your requirements.

I guess this current commit will convert into a library as per your requirements.
dunkaist marked this conversation as resolved
dunkaist reviewed 2026-03-09 14:23:51 +00:00
@@ -0,0 +6,4 @@
; ecx <- each byte
;
;=============================================================
count_utf8_codepoints:
Owner

This and all the other public functions of the library: please, make them stdcall. In particular, count_utf8_codepoints destroys ebx. Strictly speaking, it is not mandatory for exported functions in KolibriOS to follow stdcall convention. Still, it is a good practice.

This and all the other public functions of the library: please, make them [stdcall](https://en.wikipedia.org/wiki/X86_calling_conventions#stdcall). In particular, count_utf8_codepoints destroys ebx. Strictly speaking, it is not mandatory for exported functions in KolibriOS to follow stdcall convention. Still, it is a good practice.
Author
First-time contributor

ok, noted, I will do it.

ok, noted, I will do it.
Author
First-time contributor

Strictly speaking, this is still left. I will do it.

Strictly speaking, this is still left. I will do it.
dunkaist reviewed 2026-03-09 14:28:32 +00:00
@@ -0,0 +75,4 @@
inc eax
jmp read_loop_graph
done_graph:
Owner

It is convenient to use local labels inside functions. For example, here you had to invent the name 'done_graph' because just 'done' was used in the previous function. You can use a local label '.done' for every function.

It is convenient to use [local labels](https://flatassembler.net/docs.php?article=manual#1.2.3) inside functions. For example, here you had to invent the name 'done_graph' because just 'done' was used in the previous function. You can use a local label '.done' for every function.
Author
First-time contributor

I guess this commit should work. I just appended in the current commit

I guess this commit should work. I just appended in the current commit
Owner

Looks good, well done

Looks good, well done
codewithchill marked this conversation as resolved
dunkaist reviewed 2026-03-09 14:31:19 +00:00
@@ -0,0 +24,4 @@
inc eax
jmp read_loop
done:
mov eax, ebx
Owner

Sometimes you put just one space between a mnemonic and a register, sometimes more. Please, read our code style here and follow it.

Sometimes you put just one space between a mnemonic and a register, sometimes more. Please, read our code style [here](https://board.kolibrios.org/viewtopic.php?t=1950) and follow it.
codewithchill marked this conversation as resolved
dunkaist reviewed 2026-03-09 14:40:46 +00:00
@@ -0,0 +35,4 @@
; ecx <- each byte
;
;=============================================================
count_utf8_gramphene:
Owner

It is a good practice for a library to have some prefix for its exported symbols. For example, libini names its functions ini.get_str, etc. I believe a unicode library can name count_utf8_gramphene as utf8.count_graphemes or something similar as you like it.

It is a good practice for a library to have some prefix for its exported symbols. For example, [libini](https://git.kolibrios.org/KolibriOS/kolibrios/src/branch/main/programs/develop/libraries/libs-dev/libini/libini.asm#L651) names its functions ini.get_str, etc. I believe a unicode library can name count_utf8_gramphene as utf8.count_graphemes or something similar as you like it.
Author
First-time contributor

Internally, I am keeping the name same but I am exporting it with such dot(.) names. Is this OK, or should I change it?

Internally, I am keeping the name same but I am exporting it with such dot(.) names. Is this OK, or should I change it?
Owner

Some people like such prefixes, some don't. Not a requirement.

Some people like such prefixes, some don't. Not a requirement.
Author
First-time contributor

I guess I will keep the function names internally as they are but while exporting I will just append it with utf. then the function name I guess.

I guess I will keep the function names internally as they are but while exporting I will just append it with utf. then the function name I guess.
dunkaist reviewed 2026-03-09 14:54:08 +00:00
@@ -0,0 +24,4 @@
push -1
push -1
push -1
call [con_init]
Owner

I believe you can use 'invoke' macro here and in similar cases.

I believe you can use 'invoke' macro here and in similar cases.
Author
First-time contributor

Could you guide me to the doc-documentation of this invoke macro? That would be helpful. Is it in the KolibriOS or is it in FASM

Could you guide me to the doc-documentation of this invoke macro? That would be helpful. Is it in the KolibriOS or is it in FASM
Owner

invoke is one of standard fasm macro, we use it too. There is no documentation for those 8 lines, but there are examples. Look at e.g. crashtest program:

invoke crash.hash, LIBCRASH_SHA2_256, read_data, 0, bin

This is just a shorter version of the following:

push bin
push 0
push read_data
push LIBCRASH_SHA2_256
call [crash.hash]

Well, maybe using macros is not very intuitive while you are only getting familiar with fasm. Feel free to use the long explicit form so far, no pressure at all.

invoke is one of standard fasm [macro](https://git.kolibrios.org/KolibriOS/kolibrios/src/branch/main/programs/proc32.inc#L13), we use it too. There is no documentation for those 8 lines, but there are examples. Look at e.g. [crashtest](https://git.kolibrios.org/KolibriOS/kolibrios/src/commit/289eabf8a4e2f4b2258cdc3a2e6a36a965cda472/programs/develop/libraries/libcrash/crashtest.asm#L47) program: > invoke crash.hash, LIBCRASH_SHA2_256, read_data, 0, bin This is just a shorter version of the following: push bin push 0 push read_data push LIBCRASH_SHA2_256 call [crash.hash] Well, maybe using macros is not very intuitive while you are only getting familiar with fasm. Feel free to use the long explicit form so far, no pressure at all.
codewithchill marked this conversation as resolved
codewithchill marked the pull request as work in progress 2026-03-11 10:00:22 +00:00
mxlgv added the Category/LibrariesFASMGSoC
Kind
Feature
Priority
Medium
labels 2026-03-11 10:27:26 +00:00
dunkaist requested changes 2026-03-18 03:12:01 +00:00
@@ -0,0 +4,4 @@
db 'MENUET01'
dd 0x01, START, I_END
dd 0x100000 ; 1MB Memory
Owner

Just guessing 1MiB will work. However, it is better to specify exact numbers. Check a few fasm programs to see how these two fields can be specified better

Just guessing 1MiB will work. However, it is better to specify exact numbers. Check a few fasm programs to see how these two fields can be specified better
Author
First-time contributor

I guess this appended commit will fix it also.

I guess this appended commit will fix it also.
@@ -0,0 +9,4 @@
dd 0x0
dd 0x0
include '../proc32.inc'
Owner

Please, use proc32.inc, macros.inc, and struct.inc from the /programs directory. The same way use include dll.inc below

Please, use proc32.inc, macros.inc, and struct.inc from the /programs directory. The same way use include dll.inc below
codewithchill marked this conversation as resolved
@@ -0,0 +15,4 @@
START:
stdcall dll.Load, import_table
test eax, eax
Owner

Here and in all other places. Please, check and follow the code style. Originally, it was written for the kernel, but it is a good practice to follow it outside the kernel too. https://board.kolibrios.org/viewtopic.php?t=1950

Here and in all other places. Please, check and follow the code style. Originally, it was written for the kernel, but it is a good practice to follow it outside the kernel too. https://board.kolibrios.org/viewtopic.php?t=1950
Author
First-time contributor

I guess this will fix it also do you guys use a specific linter or FASM code formatter script to check these 8/16 column alignments locally, or do you just have 'kernel developer vision' from staring at the codebase all day? 😅

I guess this will fix it also do you guys use a specific linter or FASM code formatter script to check these 8/16 column alignments locally, or do you just have 'kernel developer vision' from staring at the codebase all day? 😅
@@ -0,0 +76,4 @@
align 4
import_table:
library \
console, '/sys/lib/console.obj', \
Owner

'/sys/lib/' is the default path for libraries. You can write just 'console.obj' when it is in the default location

'/sys/lib/' is the default path for libraries. You can write just 'console.obj' when it is in the default location
codewithchill marked this conversation as resolved
dunkaist requested changes 2026-03-23 19:42:38 +00:00
@@ -0,0 +1,615 @@
@^ fix macro comment {
Owner

This and other *.inc files should be removed from the PR, use these files from /programs instead

This and other *.inc files should be removed from the PR, use these files from /programs instead
codewithchill marked this conversation as resolved
codewithchill force-pushed libutf from fc40aa8f9b to 242469b1f3 2026-03-24 05:35:38 +00:00 Compare
codewithchill added 5 commits 2026-04-25 18:31:37 +00:00
- Added libunicode.asm to parse UTF-8 strings.
- Implemented count_utf8_codepoints to skip continuation bytes.
- Implemented count_utf8_graphemes to handle ZWJ (E2 80 8D) and combining marks (CC/CD).
- Added console.asm to the examples folder to test and print the results.
- Submitted for GSoC qualification task.
Edit the example file to reflect the changes
1. Use cinvoke macro instead of invoke for con_printf
2. Convert `end:` to `P_END:`, As the previous one was a FASM keyword
3. Upgrade all functions in libunicode.asm to use stdcall convention
Implemented the `is_valid_utf8_char` procedure to safely validate UTF-8
sequences and return their byte length (1-4, or 0 if invalid).

This routine implements strict Unicode compliance checks, including:
- Rejection of overlong encodings (e.g., checking 0xC0/0xC1, and strict
bounds for 0xE0/0xF0).
- Prevention of surrogate half decoding (restricting 0xED bounds).
- Enforcement of the maximum Unicode scalar value limit (U+10FFFF).
- Safe handling of null-terminators and truncated sequences.

This provides a secure foundation for upgrading the codepoint and
grapheme counting functions in upcoming commits.
Remove macros.inc, struct.inc, macros.inc files
Build system / Check kernel codestyle (pull_request) Successful in 30s
Build system / Build (pull_request) Successful in 10m14s
e36f836a8b
These files were removed as they can be directly accessed from /programs
codewithchill force-pushed libutf from 242469b1f3 to e36f836a8b 2026-04-25 18:31:37 +00:00 Compare
All checks were successful
Build system / Check kernel codestyle (pull_request) Successful in 30s
Required
Details
Build system / Build (pull_request) Successful in 10m14s
Required
Details
This pull request is marked as a work in progress.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u http://git.kolibrios.org/codewithchill/kolibrios libutf:codewithchill-libutf
git checkout codewithchill-libutf
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: KolibriOS/kolibrios#356