Wiki TEXT pages downloader
#4
Sometimes, it's easiest to just take an idea and toss it out the door and start over completely from scratch -- and that's what I've decided to do here!

The code I shared originally was basically ripped directly from the QB64 source, then stitched together, and then operated on and altered like Frankenstein, until it could cough and sputter and produce a semi-reasonable result...  

But it's long.  And messy.  And almost impossible to follow and sort out what's doing what and where it's doing it and why it's doing it...

So, I've decided to back up and reboot on my approach of handling this type issue.  What I have now is this much simpler code:

Code: (Select All)
$Console:Only
Const HomePage$ = "https://qb64phoenix.com"

NumberOfPages = DownloadPageLists


Function DownloadPageLists
    FileLeft$ = "Page List("
    FileRight$ = ").txt"
    FileCount = 1
    CurrentFile$ = ""
    url$ = "/qb64wiki/index.php/Special:AllPages" 'the first file that we download
    Do
        file$ = FileLeft$ + _Trim$(Str$(FileCount)) + FileRight$
        Download url$, file$
        url2$ = GetNextPage$(file$)
        p = InStr(url2$, "from=")
        If p = 0 Then Exit Do
        If Mid$(url2$, p + 5) > CurrentFile$ Then
            CurrentFile$ = Mid$(url2$, p + 5)
            FileCount = FileCount + 1
            url$ = url2$
        Else
            Exit Do
        End If
    Loop
    DownloadPageLists = FileCount
End Function

Function CleanHTML$ (OriginalText$)
    text$ = OriginalText$ 'don't corrupt incoming text
    Type ReplaceList
        original As String
        replacement As String
    End Type

    'Expandable HTML replacement system
    Dim HTML(1) As ReplaceList
    HTML(0).original = "&": HTML(0).replacement = "&"
    HTML(1).original = "%24": HTML(1).replacement = "$"

    For i = 0 To UBound(HTML)
        Do
            p = InStr(text$, HTML(i).original)
            If p = 0 Then Exit Do
            text$ = Left$(text$, p - 1) + HTML(i).replacement + Mid$(text$, p + Len(HTML(i).original))
        Loop
    Next
    CleanHTML$ = text$
End Function

Sub Download (url$, outputFile$)
    url2$ = CleanHTML(url$)
    'Print "https://qb64phoenix.com/qb64wiki/index.php?title=Special:AllPages&from=KEY+n"
    'Print HomePage$ + url2$
    Shell "curl -o " + Chr$(34) + outputFile$ + Chr$(34) + " " + Chr$(34) + HomePage$ + url2$ + Chr$(34)
End Sub

Function GetNextPage$ (currentPage$)
    SpecialPageDivClass$ = "<div class=" + Chr$(34) + "mw-allpages-nav" + Chr$(34) + ">"
    SpecialPageLink$ = "<a href="
    SpecialPageEndLink$ = Chr$(34) + " title"
    Open currentPage$ For Binary As #1
    l = LOF(1)
    t$ = Space$(l)
    Get #1, 1, t$
    Close
    sp = InStr(t$, SpecialPageDivClass$)
    If sp Then
        lp = InStr(sp, t$, SpecialPageLink$)
        If lp Then
            lp = lp + 9
            lp2 = InStr(lp, t$, SpecialPageEndLink$)
            link$ = Mid$(t$, lp, lp2 - lp)
            GetNextPage$ = CleanHTML(link$)
        End If
    End If
End Function


Only about 80 lines, but this already connects to the wiki and downloads us 2 pages of vitally important data -- the lists of all the pages inside our wiki!!

Just by parsing these, I should now be able to make a simple list of all the page names, as they exist in our wiki, and easily grab them and download them one after another and save them wherever I want.  

I don't have a whole wiki downloader yet, but I've got the wiki page-list downloader now in less than 80 lines of code.  It shouldn't be very hard to go from this to the finished form now, and the whole program should come in at less than a few hundred lines in total.  Wink
Reply


Messages In This Thread
Wiki TEXT pages downloader - by SMcNeill - 05-23-2022, 04:15 AM
RE: Wiki TEXT pages downloader - by RhoSigma - 05-23-2022, 08:54 AM
RE: Wiki TEXT pages downloader - by Coolman - 05-23-2022, 09:37 AM
RE: Wiki TEXT pages downloader - by SMcNeill - 05-23-2022, 08:45 PM
RE: Wiki TEXT pages downloader - by SMcNeill - 05-24-2022, 09:50 AM



Users browsing this thread: 1 Guest(s)