Testing for extended unicode?
#4
Thank you both for getting back to me so quickly. I should have been more specific in my first post. I'm running QB64pe on macOS High Sierra. Not sure where to find the QB64 build number, but the installation file is qb64pe_osx-3.5.0-18-6b9274e64.tar.gz. For my purposes I don't need to be able to read these characters, I don't speak the languages anyway. I don't need to display or print them either. I was just looking for a way to identify what kind of text, so I could determine if it is English, so I know whether to keep it in my video or remove it. It might be interesting to know what language is in the file, but not necessary. Sometimes I go at these things the hard way, it's in my nature to not only want to use things but to know how they work. I downloaded a series of subtitles in about 26 languages and started playing around with them and QB64pe. I was thinking that maybe if I could identify the code page, I could quickly determine the language of the subtitle.
Anyway, If I open the subtitle file (supposed to be UTF-8) with TextEdit (Mac equivalent of notepad) the foreign characters look like they should. If I copy them to the clipboard, then read the clipboard into my BASIC program, I can read the foreign characters as several ascii codes. Of course if printed by BASIC they no longer appear correct. I can manipulate the characters and send them back to the clipboard, and them paste them into TextEdit, where they appear manipulated but in the appropriate language. Maybe I talk too much, here is some code:

text$ = _CLIPBOARD$
PRINT text$
z$ = text$ + " + " + text$
PRINT z$
_CLIPBOARD$ = z$
SLEEP
SYSTEM


Here is an image of the results (sorry, I don't know the best way to do this):  
  
[Image: Untitled.jpg]

The first box is the way it looks in TextEdit, you can see what I am about to copy is highlighted.
The second box is how it prints to the BASIC window (after modification).
The third image is how it looks after being copied to the clipboard by BASIC then pasted into TextEdit.

So after looking at a bunch of different language files, it seems most can be identified by looking at the ascii codes. English files primarily use ascii codes between 65 and 122. Foreign languages that don't use the English alphabet use mostly ascii above 122. But there are still at least 3 languages that don't use English spellings, but do use only the English alphabet. For those I suppose I'll have to look for the occurrences of commonly used English words to weed them out. I was hoping to find a simpler test that would work on just a few characters. Such as identifying that a single character uses multiple ascii codes (the ♪ character is ascii 226+153+170), or is from a different code page. Maybe there is a better way, I'm open to suggestions.

By the way, since switching to QB64pe from the old QB64 things are sooo much easier. _MESSAGEBOX, _SELECTFOLDERDIALOG$, and _OPENFILEDIALOG$ have been priceless. 

Big Grin
Reply


Messages In This Thread
Testing for extended unicode? - by tothebin - 03-23-2023, 06:01 AM
RE: Testing for extended unicode? - by RhoSigma - 03-23-2023, 08:29 AM
RE: Testing for extended unicode? - by tothebin - 03-23-2023, 09:48 PM
RE: Testing for extended unicode? - by mnrvovrfc - 03-24-2023, 01:13 AM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 02:15 PM
RE: Testing for extended unicode? - by RhoSigma - 03-24-2023, 04:43 PM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 10:01 PM
RE: Testing for extended unicode? - by mnrvovrfc - 03-24-2023, 10:26 PM
RE: Testing for extended unicode? - by tothebin - 03-24-2023, 10:43 PM



Users browsing this thread: 1 Guest(s)