MemFile System
#11
I've got much better word lists and dictionaries, if you need something like that. The reason I tend to just use the one I chose here is simply due to its sheer size and number of entries. It makes a good baseline for timed tests to see how long it takes to load and process something. For just pure *words*, I'd suggest to just download and use the Official Scrabble Dictionary.txt. Wink
Reply
#12
(08-27-2022, 06:37 AM)SMcNeill Wrote: I've got much better word lists and dictionaries, if you need something like that.  The reason I tend to just use the one I chose here is simply due to its sheer size and number of entries.  It makes a good baseline for timed tests to see how long it takes to load and process something.  For just pure *words*, I'd suggest to just download and use the Official Scrabble Dictionary.txt.  Wink

I have, and I am. This one just caught my attention as the Scrabble one is about 280000 words. I sub-divided the Scrabble one into 26 files so I can call the appropriate file when checking words, to save search time.  Smile
Of all the places on Earth, and all the planets in the Universe, I'd rather live here (Perth, W.A.) Big Grin
Reply
#13
Something I just noticed with this is that if you are assuming that the line endings are CHR$(13) + CHR$(10) ("\r\n") then that might not work with a file that has UNIX line endings, which I think is just CHR$(10) ("\n"). You might want to split on just CHR$(10) and then check for CHR$(13) existing after the split. If it does, you can just delete those. A foolproof way that I split a file up is by using my tokenize function, which uses strtok. It takes a list of characters to split on and it works just fine regardless of the file having UNIX or Windows line endings.
Ask me about Windows API and maybe some Linux stuff
Reply
#14
(08-31-2022, 02:14 PM)Spriggsy Wrote: Something I just noticed with this is that if you are assuming that the line endings are CHR$(13) + CHR$(10) ("\r\n") then that might not work with a file that has UNIX line endings, which I think is just CHR$(10) ("\n"). You might want to split on just CHR$(10) and then check for CHR$(13) existing after the split. If it does, you can just delete those. A foolproof way that I split a file up is by using my tokenize function, which uses strtok. It takes a list of characters to split on and it works just fine regardless of the file having UNIX or Windows line endings.


Code: (Select All)
    'we want to auto-detect our CRLF endings
    'as we have the file in temp$ at the moment, we'll just search for it via instr
    If InStr(temp$, Chr$(13) + Chr$(10)) Then
        MemFile(i).CRLF = Chr$(13) + Chr$(10)
    ElseIf InStr(temp$, Chr$(10)) Then
        MemFile(i).CRLF = Chr$(10)
    ElseIf InStr(temp$, Chr$(13)) Then
        MemFile(i).CRLF = Chr$(13)
    Else
        Error 5: Exit Function
    End If


It searches your file to see what type of line endings you have in it.  Unless you have mixed endings, (like some end with CHR$(10) and others end with CHR$(13), it'll work automagically for you.  If you have mixed endings, you'll probably need to write a routine to normalize to one format or the other, before making use of these functions.  I didn't want to tie up the INPUT times by having them do a series of IF checks to see if you have a 10, 13, or 1310 set of endings on each line.  I was going a little more for speed and efficiency, which should work for 99.9% of most files, than flexibility to make certain we can read every mixed-ending file out there.  Wink
Reply
#15
(08-31-2022, 02:14 PM)Spriggsy Wrote: Something I just noticed with this is that if you are assuming that the line endings are CHR$(13) + CHR$(10) ("\r\n") then that might not work with a file that has UNIX line endings, which I think is just CHR$(10) ("\n"). You might want to split on just CHR$(10) and then check for CHR$(13) existing after the split. If it does, you can just delete those. A foolproof way that I split a file up is by using my tokenize function, which uses strtok. It takes a list of characters to split on and it works just fine regardless of the file having UNIX or Windows line endings.

If you are in need for a more comprehensive system, then take this one:
https://staging.qb64phoenix.com/showthread.php?tid=486

It's basically the same thing but build on a string array rather then _MEM.
Reply




Users browsing this thread: 5 Guest(s)