WIP: [GSoC] Add initial libunicode parser and example #356
Draft
codewithchill
wants to merge 5 commits from
codewithchill/kolibrios:libutf into main
pull from: codewithchill/kolibrios:libutf
merge into: KolibriOS:main
KolibriOS:main
KolibriOS:wolf3d-launcher
KolibriOS:icons-update
KolibriOS:kterm-upload
KolibriOS:app/socketdbg_fix1
KolibriOS:hdaudio-add-ring-buffer-for-unsolicied-events
KolibriOS:workflow-fuse
KolibriOS:add-license-file-header-to-guide
KolibriOS:blocks-add-models
KolibriOS:shell-improve-cpuid
KolibriOS:rewrite_ide_drv
KolibriOS:qrcodegen
KolibriOS:ci/update
KolibriOS:laser-tank-fix-win-height
KolibriOS:improvement/commit-and-branch-styles
KolibriOS:docs/libs
Labels
Clear labels
C
Category/Applications
Category/Drivers
Category/General
Category/Kernel
Category/Libraries
Eolite
FASM
FS
GSoC
HardwareTested
HLL
Influence/Settings
Influence/Text/TYPO
IRCC
Kernel
Pay for the code
This issue in GSoC program
Kind
Breaking
Breaking change that won't be backward compatible
Kind
Bug
Something is not working
Kind
Build
Kind
Documentation
Documentation changes
Kind
Enhancement
Improve existing functionality
Kind
Feature
New functionality
Kind
Security
This is security issue
Kind
Testing
Issue or pull request related to testing
Paid task
PR
Conflicts
PR conflicts with main
PR
Dependent
This PR is dependent on another PR
Priority
Critical
The priority is critical
Priority
High
The priority is high
Priority
Low
The priority is low
Priority
Medium
The priority is medium
PR
Ready to merge
Pull request is ready for merge
PR
Request changes
Changes requested in pull request
PR
Review required
Reviewed
Confirmed
Issue has been confirmed
Reviewed
Duplicate
This issue or pull request already exists
Reviewed
Invalid
Invalid issue
Reviewed
Won't Fix
This issue won't be fixed
Status
Abandoned
Somebody has started to work on this but abandoned work
Status
Blocked
Something is blocking this issue or pull request
Status
Need More Info
Feedback is required to reproduce issue or to continue work
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: KolibriOS/kolibrios#356
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "codewithchill/kolibrios:libutf"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi Ivan and team,
This PR introduces the initial setup for
libunicodeas part of my GSoC qualification task.What is included:
/programs/develop/libraries/libunicode/libunicode.asm: Contains the core parsing logic.count_utf8_codepoints: Counts raw Unicode values.count_utf8_graphemes: Counts visual characters (includes logic to subtract counts for Zero-Width Joiners and basic combining marks)./programs/develop/libraries/libunicode/examples/console.asm: A test application that passes a complex UTF-8 string to the functions and prints the results to the console.I have tested this against standard ASCII, Russian text, complex ZWJ emojis, and accented characters.
I am looking forward to your feedback on KolibriOS code style, formatting, and best practices so I can update this to match the official OS standards!
@@ -0,0 +1,81 @@;=============================================================The current libunicode.asm is a set of functions. To use them, a user has to include the file into each program. This means the code of libunicode.asm is duplicated for each program using it. Let's turn libunicode.asm into a dynamically loadable library like e.g. libcrash. Then you can put libunicode.obj to /sys/lib/ and use it just like you are using console.obj.
I will do it later, later on when I get more used to the file type and dynamic libraries.
Ping me during this weekend if you can't convert this code to a dynamic library. We will take a look
I guess this current commit will convert into a library as per your requirements.
@@ -0,0 +6,4 @@; ecx <- each byte;;=============================================================count_utf8_codepoints:This and all the other public functions of the library: please, make them stdcall. In particular, count_utf8_codepoints destroys ebx. Strictly speaking, it is not mandatory for exported functions in KolibriOS to follow stdcall convention. Still, it is a good practice.
ok, noted, I will do it.
Strictly speaking, this is still left. I will do it.
@@ -0,0 +75,4 @@inc eaxjmp read_loop_graphdone_graph:It is convenient to use local labels inside functions. For example, here you had to invent the name 'done_graph' because just 'done' was used in the previous function. You can use a local label '.done' for every function.
I guess this commit should work. I just appended in the current commit
Looks good, well done
@@ -0,0 +24,4 @@inc eaxjmp read_loopdone:mov eax, ebxSometimes you put just one space between a mnemonic and a register, sometimes more. Please, read our code style here and follow it.
@@ -0,0 +35,4 @@; ecx <- each byte;;=============================================================count_utf8_gramphene:It is a good practice for a library to have some prefix for its exported symbols. For example, libini names its functions ini.get_str, etc. I believe a unicode library can name count_utf8_gramphene as utf8.count_graphemes or something similar as you like it.
Internally, I am keeping the name same but I am exporting it with such dot(.) names. Is this OK, or should I change it?
Some people like such prefixes, some don't. Not a requirement.
I guess I will keep the function names internally as they are but while exporting I will just append it with utf. then the function name I guess.
@@ -0,0 +24,4 @@push -1push -1push -1call [con_init]I believe you can use 'invoke' macro here and in similar cases.
Could you guide me to the doc-documentation of this invoke macro? That would be helpful. Is it in the KolibriOS or is it in FASM
invoke is one of standard fasm macro, we use it too. There is no documentation for those 8 lines, but there are examples. Look at e.g. crashtest program:
This is just a shorter version of the following:
push bin
push 0
push read_data
push LIBCRASH_SHA2_256
call [crash.hash]
Well, maybe using macros is not very intuitive while you are only getting familiar with fasm. Feel free to use the long explicit form so far, no pressure at all.
@@ -0,0 +4,4 @@db 'MENUET01'dd 0x01, START, I_ENDdd 0x100000 ; 1MB MemoryJust guessing 1MiB will work. However, it is better to specify exact numbers. Check a few fasm programs to see how these two fields can be specified better
I guess this appended commit will fix it also.
@@ -0,0 +9,4 @@dd 0x0dd 0x0include '../proc32.inc'Please, use proc32.inc, macros.inc, and struct.inc from the /programs directory. The same way use include dll.inc below
@@ -0,0 +15,4 @@START:stdcall dll.Load, import_tabletest eax, eaxHere and in all other places. Please, check and follow the code style. Originally, it was written for the kernel, but it is a good practice to follow it outside the kernel too. https://board.kolibrios.org/viewtopic.php?t=1950
I guess this will fix it also do you guys use a specific linter or FASM code formatter script to check these 8/16 column alignments locally, or do you just have 'kernel developer vision' from staring at the codebase all day? 😅
@@ -0,0 +76,4 @@align 4import_table:library \console, '/sys/lib/console.obj', \'/sys/lib/' is the default path for libraries. You can write just 'console.obj' when it is in the default location
@@ -0,0 +1,615 @@@^ fix macro comment {This and other *.inc files should be removed from the PR, use these files from /programs instead
fc40aa8f9bto242469b1f3242469b1f3toe36f836a8bView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.