QB64 Phoenix Edition
Testing for extended unicode? - Printable Version

+- QB64 Phoenix Edition (https://staging.qb64phoenix.com)
+-- Forum: QB64 Rising (https://staging.qb64phoenix.com/forumdisplay.php?fid=1)
+--- Forum: Code and Stuff (https://staging.qb64phoenix.com/forumdisplay.php?fid=3)
+---- Forum: Help Me! (https://staging.qb64phoenix.com/forumdisplay.php?fid=10)
+---- Thread: Testing for extended unicode? (/showthread.php?tid=1572)



Testing for extended unicode? - tothebin - 03-23-2023

I've done a lot of searching here and in the wiki and can't seem to figure this out. I am reading subtitle (.srt) files for videos, and trying to determine programmatically which are in English. It's actually much easier to look for foreign characters and determine which files are NOT in English. The files are UTF-8. I understand the foreign characters use extended unicode to display, and I think I understand how to use _MAPUNICODE to map ASCII characters to other unicode characters. What I can't seem to find is how to work this the other direction. If I copy some foreign language characters from a text file to the clipboard, say Greek letters, how do I read them from the clipboard and determine they are Greek and not standard ASCII characters? I noticed that when I read these characters from the clipboard using ASC() a single Greek character shows up with several ASCII codes. But how do I know that isn't several 'standard' characters. Is it just a matter of using the right variable type?


RE: Testing for extended unicode? - RhoSigma - 03-23-2023

What you're tying to do is not possible in current QB64 versions. QB64 uses the flag CF_TEXT in its _CLIPBOARD$ commands, i.e. even if you copy unicode from some other program into the clipboard you won't get it back as is. The OS knows it currently has unicode in the clip, but QB64 requests to get out regular ansi text (CF_TEXT instead of CF_UNICODETEXT), so the OS goes and internally converts the clip contents to fit the ansi type requested by QB64.

I had the same problem when doing some automation to maintain our Wiki via clipboard, but it constantly messed up any extended ASCII characters.

I'll make an issue at github for it, for a quick fix you could go into file "internal/c/libqb.cpp" looking for the "sub__clipboard" and "func__clipboard" and in that sub/functions replace all occurrences of CF_TEXT with CF_UNICODETEXT and save the file.

After that start the IDE and select the entry "Purge C++ libraries" from the "Debug" menu. After that recompile your program and it should work with unicode text.

BUT ATTENTION: Unicode on the clipboard does usually mean you work with wide chars, not ASCII anymore, i.e. encoding is now UTF-16 (little endian).

Maybe Zak (@Balderdash) can give you more info on that, it's probably easier/safer to use direct WinAPI calls, than messing with QB64 internals.


RE: Testing for extended unicode? - SpriggsySpriggs - 03-23-2023

You can both read and display Unicode characters in QB64 by using Win32 API. You'd need to use the "W" variant functions. I've used a lot of Unicode strings in QB64. Asian characters, especially. That isn't to say I was handling keyboard input or anything like that but I was both reading and displaying them (Unicode strings) properly. They're just a pain in the butt and can lead to crashes and memory leaks since QB64 isn't designed to handle them. You have to manage them with great care. I think I'd need you to write up a bit of pseudo code for your program's logic so I could better understand what it is you are needing on the Unicode side. Then I might be able to get you started.


RE: Testing for extended unicode? - tothebin - 03-23-2023

Thank you both for getting back to me so quickly. I should have been more specific in my first post. I'm running QB64pe on macOS High Sierra. Not sure where to find the QB64 build number, but the installation file is qb64pe_osx-3.5.0-18-6b9274e64.tar.gz. For my purposes I don't need to be able to read these characters, I don't speak the languages anyway. I don't need to display or print them either. I was just looking for a way to identify what kind of text, so I could determine if it is English, so I know whether to keep it in my video or remove it. It might be interesting to know what language is in the file, but not necessary. Sometimes I go at these things the hard way, it's in my nature to not only want to use things but to know how they work. I downloaded a series of subtitles in about 26 languages and started playing around with them and QB64pe. I was thinking that maybe if I could identify the code page, I could quickly determine the language of the subtitle.
Anyway, If I open the subtitle file (supposed to be UTF-8) with TextEdit (Mac equivalent of notepad) the foreign characters look like they should. If I copy them to the clipboard, then read the clipboard into my BASIC program, I can read the foreign characters as several ascii codes. Of course if printed by BASIC they no longer appear correct. I can manipulate the characters and send them back to the clipboard, and them paste them into TextEdit, where they appear manipulated but in the appropriate language. Maybe I talk too much, here is some code:

text$ = _CLIPBOARD$
PRINT text$
z$ = text$ + " + " + text$
PRINT z$
_CLIPBOARD$ = z$
SLEEP
SYSTEM


Here is an image of the results (sorry, I don't know the best way to do this):  
  
[Image: Untitled.jpg]

The first box is the way it looks in TextEdit, you can see what I am about to copy is highlighted.
The second box is how it prints to the BASIC window (after modification).
The third image is how it looks after being copied to the clipboard by BASIC then pasted into TextEdit.

So after looking at a bunch of different language files, it seems most can be identified by looking at the ascii codes. English files primarily use ascii codes between 65 and 122. Foreign languages that don't use the English alphabet use mostly ascii above 122. But there are still at least 3 languages that don't use English spellings, but do use only the English alphabet. For those I suppose I'll have to look for the occurrences of commonly used English words to weed them out. I was hoping to find a simpler test that would work on just a few characters. Such as identifying that a single character uses multiple ascii codes (the ♪ character is ascii 226+153+170), or is from a different code page. Maybe there is a better way, I'm open to suggestions.

By the way, since switching to QB64pe from the old QB64 things are sooo much easier. _MESSAGEBOX, _SELECTFOLDERDIALOG$, and _OPENFILEDIALOG$ have been priceless. 

Big Grin


RE: Testing for extended unicode? - mnrvovrfc - 03-24-2023

You're writing in Arabic which brings another complication: such language starts at the right margin, not the left like English. The QB64(PE) IDE was designed only for English left-to-right "printing".

It could be possible to use CHR$() if the Unicode combinations are known, and create output files with them and display them in an Unicode-capable program like Libreoffice Writer or Kate. But you will have to limit yourself to file input from and output to disk because QB64 wasn't designed to handle Asian writing styles. The _MAPUNICODE is a patchy solution which didn't completely solve problems with European languages, so it's not going to be better about Asian languages, each one which requires thousands of glyphs and a single font would have to accomodate all of them which bloats its size when compared to the old Microsoft FON files.

You've demonstrated you know how to come up with some of those glyphs. You'll have to stick to that program you're using and come up with the Unicode combinations, one byte at a time because CHR$() can't handle values larger than 255. What Balderdash proposes is reliable but a pain to program to someone who is comfortable with BASIC.


RE: Testing for extended unicode? - tothebin - 03-24-2023

(03-24-2023, 01:13 AM)mnrvovrfc Wrote: You're writing in Arabic which brings another complication: such language starts at the right margin, not the left like English. The QB64(PE) IDE was designed only for English left-to-right "printing".

It could be possible to use CHR$() if the Unicode combinations are known, and create output files with them and display them in an Unicode-capable program like Libreoffice Writer or Kate. But you will have to limit yourself to file input from and output to disk because QB64 wasn't designed to handle Asian writing styles. The _MAPUNICODE is a patchy solution which didn't completely solve problems with European languages, so it's not going to be better about Asian languages, each one which requires thousands of glyphs and a single font would have to accomodate all of them which bloats its size when compared to the old Microsoft FON files.

You've demonstrated you know how to come up with some of those glyphs. You'll have to stick to that program you're using and come up with the Unicode combinations, one byte at a time because CHR$() can't handle values larger than 255. What Balderdash proposes is reliable but a pain to program to someone who is comfortable with BASIC.

I noticed the right-to-left nature in TextEdit, that was interesting. The goal here wasn't to read or print any of these characters, but simply to know that they aren't English. In rebuilding my video database I am extracting all subtitles from each of thousands of videos, then programmatically eliminating the foreign language ones, and identifying the English ones as Normal, Forced, or Hearing Impaired. Doing that manually would take forever (or at least it would seem like forever). After the subtitles have been 'filtered', they will be re-embedded into their respective videos. Part of that process also includes converting .ass subtitles to .srt, and 'cleaning' the subtitle files to remove improper formatting and such. No sense in converting and cleaning a subtitle file only to erase it later when I see it is foreign language.

I've noticed that foreign characters are returned by ASC() as multi-byte strings. But when parsing the clipboard I don't see a way of determining if several ascii codes represent the combination used to make a foreign character, or are just several English characters. Unless there is a way to do that, it seems the best route for me so far is to  eliminate files containing large amounts of characters with high ascii numbers as being foreign, then use other methods on the remaining files to determine their nature.

You said "CHR$() can't handle values larger than 255". Using ASC() to parse the clipboard, "¿" reads as 194+191, so does 2 of the graphic characters from the normal font (sorry, don't know how to paste them here). I wish I could tell the difference.


RE: Testing for extended unicode? - RhoSigma - 03-24-2023

If your goal is just to know if it's english, or better regular 7-bit ASCII (0-127) then you just need to identify the UTF-8 markers. Everything which is not a UTF-8 sequence is automatically pure ASCII.

I use such a check in the code which renders the Wiki help text in the IDE, it's basically as follows:
Code: (Select All)
'UTF-8 handling
text$ = "whatever you get from your input"
FOR currPos% = 1 TO LEN(text$)
    seq$ = MID$(text$, currPos%, 4) '   'get next 4 chars (becomes less 4 at the end of text$)
    seq$ = seq$ + SPACE$(4 - LEN(seq$)) 'fill missing chars with space (safety for ASC())
    IF (((ASC(seq$, 1) AND &HE0~%%) = 192) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) THEN
        '2-byte UTF-8
    ELSEIF (((ASC(seq$, 1) AND &HF0~%%) = 224) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) AND ((ASC(seq$, 3) AND &HC0~%%) = 128) THEN
        '3-byte UTF-8
    ELSEIF (((ASC(seq$, 1) AND &HF8~%%) = 240) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) AND ((ASC(seq$, 3) AND &HC0~%%) = 128) AND ((ASC(seq$, 4) AND &HC0~%%) = 128) THEN
        '4-byte UTF-8
    ELSE
        '1st char of seq$ = regular ASCII
    END IF
NEXT



RE: Testing for extended unicode? - tothebin - 03-24-2023

(03-24-2023, 04:43 PM)RhoSigma Wrote: If your goal is just to know if it's english, or better regular 7-bit ASCII (0-127) then you just need to identify the UTF-8 markers. Everything which is not a UTF-8 sequence is automatically pure ASCII.

I use such a check in the code which renders the Wiki help text in the IDE, it's basically as follows:
Code: (Select All)
'UTF-8 handling
text$ = "whatever you get from your input"
FOR currPos% = 1 TO LEN(text$)
    seq$ = MID$(text$, currPos%, 4) '   'get next 4 chars (becomes less 4 at the end of text$)
    seq$ = seq$ + SPACE$(4 - LEN(seq$)) 'fill missing chars with space (safety for ASC())
    IF (((ASC(seq$, 1) AND &HE0~%%) = 192) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) THEN
        '2-byte UTF-8
    ELSEIF (((ASC(seq$, 1) AND &HF0~%%) = 224) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) AND ((ASC(seq$, 3) AND &HC0~%%) = 128) THEN
        '3-byte UTF-8
    ELSEIF (((ASC(seq$, 1) AND &HF8~%%) = 240) AND ((ASC(seq$, 2) AND &HC0~%%) = 128) AND ((ASC(seq$, 3) AND &HC0~%%) = 128) AND ((ASC(seq$, 4) AND &HC0~%%) = 128) THEN
        '4-byte UTF-8
    ELSE
        '1st char of seq$ = regular ASCII
    END IF
NEXT

PERFECT!!!!!!!!!!!!!!!!!    Thank you very much!!! That was exactly what I was looking for. I knew there had to be a way. I don't fully understand the coding example yet, I've never needed to use Hex before, or multi-byte text. But I understand then in principle, and kind of understand what you did. I'm sure with a little reading I will understand it completely. This will certainly get me there. Thank you again and I can't wait to dig into this and get it working. 

Maybe one of these days I'll dig into libraries...


RE: Testing for extended unicode? - mnrvovrfc - 03-24-2023

This is something like MIDI "delta time".

As I've said, CHR$() can only take in an unsigned byte as parameter, which cannot go higher than 255. But using that to reconstruct an Unicode character assumes UTF-8 somewhere, doesn't it? I've learned something here. It might not give me everything I want such as getting an Unicode-ready text editor or word processor to display whatever character I want that I could see with a given font loaded into Gucharmap. At least so I could take a screenshot and then capture those characters as BMP or something to be able to use them in a QB64 program.


RE: Testing for extended unicode? - tothebin - 03-24-2023

(03-24-2023, 10:26 PM)mnrvovrfc Wrote: This is something like MIDI "delta time".

As I've said, CHR$() can only take in an unsigned byte as parameter, which cannot go higher than 255. But using that to reconstruct an Unicode character assumes UTF-8 somewhere, doesn't it? I've learned something here. It might not give me everything I want such as getting an Unicode-ready text editor or word processor to display whatever character I want that I could see with a given font loaded into Gucharmap. At least so I could take a screenshot and then capture those characters as BMP or something to be able to use them in a QB64 program.

That's why I always like to know how something works, not just how to use it. True knowledge is never useless, only the application of it might be. When I moved into my new house, my 95lb wife and I moved my 3500lb milling machine from the driveway, across the garage, up a step and through a doorway into the new shop, using just chunks of wood and a hand-truck. When I saw the movers trying to man-handle the clothes washer I gave them the hand-truck. No kidding, I put the hand-truck under the washer and watched two grown men pick up the hand-truck/washer and carry them into the house. I hope I never stop learning.