Validate External Links: Difference between revisions

From OniGalore
Jump to navigation Jump to search
(this feature already exists, I just haven't been using it)
m (link fix)
 
(2 intermediate revisions by the same user not shown)
Line 2: Line 2:


==Background==
==Background==
While MediaWiki makes it easy to find bad intrawiki links (links to nonexistent pages on our own wiki), marking them in red and providing tools like [[Special:Wantedpages]], there is no automatic check of external (outbound) links. MediaWiki compiles external links into a table, but it does not ping the URLs to see if they give any response. Over the years, many links on our wiki went dead as the Web changed and various file hosts went out of business. ValExtLinks has been used to fix over 1,000 link issues on OniGalore such as 404s and redirects.
While MediaWiki makes it easy to find bad intrawiki links (links to nonexistent pages on our own wiki) by marking them in red and providing tools like [[Special:Wantedpages]], there is no automatic check of external (outbound) links. MediaWiki compiles external links into a table, but it does not ping the URLs to see if they give any response. Over the years, many links on our wiki went dead as the Web changed and various file hosts went out of business. ValExtLinks has been used to fix thousands of link issues on OniGalore such as 404s and redirects.


Here's how the process works: at 6:20am and 2:20pm (GMT) each day, a script written by [[User:Admin|Alloc]] dumps the wiki's external links table to [https://wiki.oni2.net/w/extlinks.csv this location]. ValExtLinks, which Iritscen runs on his computer periodically, walks through the exported table and looks for URLs that return problematic codes such as 404. It also detects other lesser problems with links. Val then makes suggestions for fixing these links and uploads its report in HTML, RTF and TXT formats to [http://iritscen.oni2.net/val/ this directory]. A wiki editor can then review the report and act accordingly.
Here's how the process works: twice a day (6:20am and 2:20pm GMT), a script written by [[User:Admin|Alloc]] dumps the wiki's external links table to [https://wiki.oni2.net/w/extlinks.csv this location]. ValExtLinks, which Iritscen runs on his computer periodically, walks through the exported table and looks for URLs that return problematic responses such as 301 and 404. It also detects other lesser problems with links. Val then makes suggestions for fixing these links and uploads its findings in HTML, RTF and TXT formats to [http://iritscen.oni2.net/val/ this directory]. Any wiki editor can then review the reports and act accordingly.


==How to fix link issues==
==Running and contributing==
Here are the codes that you'll see on problem links in the report.
The project is found [https://websvn.illy.bz/listing.php?repname=Oni2&path=%2FValidate+External+Links%2F HERE]. Along with the Bash shell script itself, you'll find documentation on how to run ValExtLinks on your own computer as well as resources for contributing to the code.
*'''NG''': In most cases, fixing an NG ("no good") link will mean finding the desired web page in the Internet Archive's [https://archive.org/web/ Wayback Machine] and linking to that archived page instead. In some cases, an NG link will not be salvageable and should be either removed from the page or, if the link was a part of a conversation and it would be confusing for it to be absent, it should be surrounded in nowiki tags [[Special:Diff/16377/26212|like this]] to prevent it from showing up in future reports.
 
**Val automatically queries the Archive for the latest snapshot of each NG page and will put the returned snapshot URL in its report. Note that you still have to verify this link by clicking on it, as it may not have the correct content. You may have to go further back in the Wayback Machine to find the proper snapshot to use. Sometimes the Archive simply never got around to archiving a given site. In that case, you will need to follow the advice above as to deleting the link or marking it with nowiki tags.
==Fixing link issues in a report==
**Note: In a typical run of Val across the 3,000+ links on the wiki, 1-3 sites will happen to be offline at the moment or the HTTP packets requesting them will get lost in the Internet. It's best to wait for another Val report to make sure that the URL is really dead before performing any of the above fixes.
Here are the codes that you'll see applied to problem links in the report.
*'''RD''': The site is redirecting the browser to a new page. The new page should be evaluated, and if it has the content we intended to link to then we should update the link to point to the new location. However, many redirects actually are "soft 404s" and simply redirect the browser to the site's main page. In this case, an RD link needs to be treated like an NG link (see above).
*'''NG''': In most cases, fixing an NG ("no good") link will mean finding the desired web page in the Internet Archive's [https://web.archive.org/ Wayback Machine] and linking to that archived copy instead. In some cases, an NG link will not be salvageable and should be either removed from the page or, if the link was a part of a conversation and it would be confusing for it to be absent, it should be surrounded in <code>nowiki</code> tags [[Special:Diff/40524|like this]] to prevent it from showing up in future reports.
*'''EI''': An "external internal" link, that is, a full URL for a page that is on our own wiki and which should simply be an [[Help:Editing#Intrawiki_links|intrawiki link]]. Sometimes an "external internal" may seem to be necessary but can be avoided with one of these special wiki features:
**Val automatically queries the Archive for the latest snapshot of each NG page and will put this snapshot URL in its report. Note that you should still check this snapshot to make sure it has the desired content. You may have to go further back in the Wayback Machine to find the proper snapshot to use. Sometimes the Archive simply never got around to archiving a given site. In that case, you will need to follow the advice above as to deleting the link or marking it with <code>nowiki</code> tags.
**If you want to link to a specific revision of a page, you might think you need a full URL [https://wiki.oni2.net/w/index.php?title=Oni&oldid=7685 like this one]. There's actually no need to link to any page at all, as the "ID" of a page revision like the one you see in that sample URL is unique wiki-wide. All you need to do is supply the revision ID to the Special:Permalink page like this — [[Special:Permalink/7685]] — and you're done.
**Note: In a typical run of ValExtLinks across the 3,500 links on the wiki, 1-3 sites will happen to be offline at the moment or the HTTP packets requesting them will get lost in cyberspace. Before attempting to perform any of the above fixes, try the link manually to make sure the site is really down. Even if it is, you might want to wait for next week's Val report to see if the site is permanently dead.
**If you need to link to a diff between two revisions of a page, or between two different pages, plug the old and new revision numbers into the Special:Diff page like this: [[Special:Diff/21491/21492]] (no need for page names, as explained above).
*'''RD''': The site is redirecting the browser to a new page. If the new page has the content we intended to link to, we should update the link to point to this new location. Be aware that some redirects are actually "soft 404s", redirecting the browser to the site's main page. In this case, an RD link needs to be treated like an NG link (see above).
**If there's no way around a bare URL, see "Exceptions" below to remove the link from the report.
*'''EI''': An "external internal" link, that is, a full URL for a page that is on our own wiki and which should simply be an [[Help:Editing#Intrawiki links|intrawiki link]]. Sometimes an "external internal" may seem to be necessary but can be avoided with one of these special wiki features:
*'''IW''': This marks an external link (bare URL) to another wiki which could be an [[Help:Editing#Interwiki_links|interwiki link]]. Interwiki links are shorter and more resistant to rot. The suggested interwiki link markup will be given in the report. For foreign-language Wikipedia pages, you can add a language code, e.g. <nowiki>[[wp:de:Test]]</nowiki> for the German version of the page.
**If you want to link to a specific revision of a page, you might think you need a full URL [https://wiki.oni2.net/w/index.php?title=Oni&oldid=7685 like this one]. There's actually no need to link to any page at all, as the "ID" of a page revision (which you will see in that sample URL) is unique wiki-wide. All you need to do is supply the revision ID to the Special:Permalink page like this — [[Special:Permalink/7685]] — and you're done.
*'''(xxx)''': The HTTP response code (see reference [[/HTTP codes|HERE]]).
**If you want to link to a specific revision as a diff from the previous revision of that page, plug the revision number into the Special:Diff page like this: [[Special:Diff/40550]] (no need for page names, as explained above). To link to a diff between two non-contiguous revisions of a page or between two different pages, plug the old and new revision numbers into the Special:Diff page like this: [[Special:Diff/21491/21492]].
*'''(000-xx)''': The response code from the Unix tool 'curl', for cases where it did not get an HTTP response code. See [[/Curl codes|HERE]] for a list of codes. The most common curl error by far is "000-28", a timeout from an unresponsive site.
**If there's no way around a bare URL, see the "Exceptions" section below to learn how to remove the link from the report.
*'''IW''': This marks an external link (bare URL) to another wiki which could be an [[Help:Editing#Interwiki links|interwiki link]]. Interwiki links are shorter and more resistant to rot. The suggested interwiki link markup will be given in the report. For foreign-language Wikipedia pages, you can add a language code, e.g. <nowiki>[[wp:de:Test]]</nowiki> for the German version of the page "Test".
*'''(xxx)''': The HTTP response code (see HTTP code reference [[/HTTP codes|HERE]]).
*'''(000-xx)''': The exit code from the Unix tool 'curl', for cases where it failed to get an HTTP response code (see 'curl' code reference [[/Curl codes|HERE]]). The most common 'curl' error by far is "000-28", a timeout from an unresponsive site.


===Exceptions===
===Exceptions===
Some links simply must be presented in an unconventional way which Val thinks is a problem. Some links return error codes but actually work fine. Such links can be added to the [[/Exceptions|exceptions list]] in order to hide them in future reports.
Some links simply must be presented in an unconventional way which Val thinks is a problem. Some links return error codes but actually work fine. Such links can be added to the [[/Exceptions|exceptions list]] in order to hide them in future reports.


In the summary at the bottom of the report, Val will list any exception that didn't have the intended effect because the link is no longer present on the listed page, or because it doesn't return that error code anymore. You can then edit the above exceptions list accordingly. Note that the HTML report only gives the number of issues detected, and the list of issues is found in the RTF and TXT versions of the report.
In the summary at the bottom of a ValExtLinks report, Val will list any exception that didn't have the intended effect because the link is no longer present on the listed page or because it doesn't return the expected error code anymore. You can then edit the wiki's exceptions list accordingly. Note that the HTML report only gives the number of exception issues detected, and the actual list of issues is found in the RTF and TXT versions of the report.
 
==Source code==
The project is found [http://websvn.chrilly.net/listing.php?repname=Oni2&path=%2FValidate+External+Links%2F HERE]. Along with the Bash script itself, you'll find documentation on how to run ValExtLinks on your own computer as well as resources for contributing to the code.


[[Category:Wiki Support]]
[[Category:Wiki Support]]

Latest revision as of 14:42, 24 March 2024

Developed by Iritscen, Validate External Links ("ValExtLinks" for short, or "Val" for even shorter) is a Bash shell script made to help fight the battle against link rot on OniGalore. The latest report on link issues is found HERE. Further link work which requires a bot is performed by ValBot.

Background

While MediaWiki makes it easy to find bad intrawiki links (links to nonexistent pages on our own wiki) by marking them in red and providing tools like Special:Wantedpages, there is no automatic check of external (outbound) links. MediaWiki compiles external links into a table, but it does not ping the URLs to see if they give any response. Over the years, many links on our wiki went dead as the Web changed and various file hosts went out of business. ValExtLinks has been used to fix thousands of link issues on OniGalore such as 404s and redirects.

Here's how the process works: twice a day (6:20am and 2:20pm GMT), a script written by Alloc dumps the wiki's external links table to this location. ValExtLinks, which Iritscen runs on his computer periodically, walks through the exported table and looks for URLs that return problematic responses such as 301 and 404. It also detects other lesser problems with links. Val then makes suggestions for fixing these links and uploads its findings in HTML, RTF and TXT formats to this directory. Any wiki editor can then review the reports and act accordingly.

Running and contributing

The project is found HERE. Along with the Bash shell script itself, you'll find documentation on how to run ValExtLinks on your own computer as well as resources for contributing to the code.

Fixing link issues in a report

Here are the codes that you'll see applied to problem links in the report.

  • NG: In most cases, fixing an NG ("no good") link will mean finding the desired web page in the Internet Archive's Wayback Machine and linking to that archived copy instead. In some cases, an NG link will not be salvageable and should be either removed from the page or, if the link was a part of a conversation and it would be confusing for it to be absent, it should be surrounded in nowiki tags like this to prevent it from showing up in future reports.
    • Val automatically queries the Archive for the latest snapshot of each NG page and will put this snapshot URL in its report. Note that you should still check this snapshot to make sure it has the desired content. You may have to go further back in the Wayback Machine to find the proper snapshot to use. Sometimes the Archive simply never got around to archiving a given site. In that case, you will need to follow the advice above as to deleting the link or marking it with nowiki tags.
    • Note: In a typical run of ValExtLinks across the 3,500 links on the wiki, 1-3 sites will happen to be offline at the moment or the HTTP packets requesting them will get lost in cyberspace. Before attempting to perform any of the above fixes, try the link manually to make sure the site is really down. Even if it is, you might want to wait for next week's Val report to see if the site is permanently dead.
  • RD: The site is redirecting the browser to a new page. If the new page has the content we intended to link to, we should update the link to point to this new location. Be aware that some redirects are actually "soft 404s", redirecting the browser to the site's main page. In this case, an RD link needs to be treated like an NG link (see above).
  • EI: An "external internal" link, that is, a full URL for a page that is on our own wiki and which should simply be an intrawiki link. Sometimes an "external internal" may seem to be necessary but can be avoided with one of these special wiki features:
    • If you want to link to a specific revision of a page, you might think you need a full URL like this one. There's actually no need to link to any page at all, as the "ID" of a page revision (which you will see in that sample URL) is unique wiki-wide. All you need to do is supply the revision ID to the Special:Permalink page like this — Special:Permalink/7685 — and you're done.
    • If you want to link to a specific revision as a diff from the previous revision of that page, plug the revision number into the Special:Diff page like this: Special:Diff/40550 (no need for page names, as explained above). To link to a diff between two non-contiguous revisions of a page or between two different pages, plug the old and new revision numbers into the Special:Diff page like this: Special:Diff/21491/21492.
    • If there's no way around a bare URL, see the "Exceptions" section below to learn how to remove the link from the report.
  • IW: This marks an external link (bare URL) to another wiki which could be an interwiki link. Interwiki links are shorter and more resistant to rot. The suggested interwiki link markup will be given in the report. For foreign-language Wikipedia pages, you can add a language code, e.g. [[wp:de:Test]] for the German version of the page "Test".
  • (xxx): The HTTP response code (see HTTP code reference HERE).
  • (000-xx): The exit code from the Unix tool 'curl', for cases where it failed to get an HTTP response code (see 'curl' code reference HERE). The most common 'curl' error by far is "000-28", a timeout from an unresponsive site.

Exceptions

Some links simply must be presented in an unconventional way which Val thinks is a problem. Some links return error codes but actually work fine. Such links can be added to the exceptions list in order to hide them in future reports.

In the summary at the bottom of a ValExtLinks report, Val will list any exception that didn't have the intended effect because the link is no longer present on the listed page or because it doesn't return the expected error code anymore. You can then edit the wiki's exceptions list accordingly. Note that the HTML report only gives the number of exception issues detected, and the actual list of issues is found in the RTF and TXT versions of the report.