Testing for extended unicode?
#6
(03-24-2023, 01:13 AM)mnrvovrfc Wrote: You're writing in Arabic which brings another complication: such language starts at the right margin, not the left like English. The QB64(PE) IDE was designed only for English left-to-right "printing".

It could be possible to use CHR$() if the Unicode combinations are known, and create output files with them and display them in an Unicode-capable program like Libreoffice Writer or Kate. But you will have to limit yourself to file input from and output to disk because QB64 wasn't designed to handle Asian writing styles. The _MAPUNICODE is a patchy solution which didn't completely solve problems with European languages, so it's not going to be better about Asian languages, each one which requires thousands of glyphs and a single font would have to accomodate all of them which bloats its size when compared to the old Microsoft FON files.

You've demonstrated you know how to come up with some of those glyphs. You'll have to stick to that program you're using and come up with the Unicode combinations, one byte at a time because CHR$() can't handle values larger than 255. What Balderdash proposes is reliable but a pain to program to someone who is comfortable with BASIC.

I noticed the right-to-left nature in TextEdit, that was interesting. The goal here wasn't to read or print any of these characters, but simply to know that they aren't English. In rebuilding my video database I am extracting all subtitles from each of thousands of videos, then programmatically eliminating the foreign language ones, and identifying the English ones as Normal, Forced, or Hearing Impaired. Doing that manually would take forever (or at least it would seem like forever). After the subtitles have been 'filtered', they will be re-embedded into their respective videos. Part of that process also includes converting .ass subtitles to .srt, and 'cleaning' the subtitle files to remove improper formatting and such. No sense in converting and cleaning a subtitle file only to erase it later when I see it is foreign language.

I've noticed that foreign characters are returned by ASC() as multi-byte strings. But when parsing the clipboard I don't see a way of determining if several ascii codes represent the combination used to make a foreign character, or are just several English characters. Unless there is a way to do that, it seems the best route for me so far is to  eliminate files containing large amounts of characters with high ascii numbers as being foreign, then use other methods on the remaining files to determine their nature.

You said "CHR$() can't handle values larger than 255". Using ASC() to parse the clipboard, "¿" reads as 194+191, so does 2 of the graphic characters from the normal font (sorry, don't know how to paste them here). I wish I could tell the difference.
Reply


Messages In This Thread
Testing for extended unicode? - by tothebin - 03-23-2023, 06:01 AM
RE: Testing for extended unicode? - by RhoSigma - 03-23-2023, 08:29 AM
RE: Testing for extended unicode? - by tothebin - 03-23-2023, 09:48 PM
RE: Testing for extended unicode? - by mnrvovrfc - 03-24-2023, 01:13 AM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 02:15 PM
RE: Testing for extended unicode? - by RhoSigma - 03-24-2023, 04:43 PM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 10:01 PM
RE: Testing for extended unicode? - by mnrvovrfc - 03-24-2023, 10:26 PM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 10:43 PM



Users browsing this thread: 4 Guest(s)