Update README.md (#182)

scraper to dump-generator

---------

Co-authored-by: Elsie Hupp <github@elsiehupp.com>
pull/475/head
Rob Kam 9 months ago committed by GitHub
parent 625481b7b8
commit 38cf49dae0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -158,7 +158,7 @@ pip uninstall wikiteam3
```
```bash
rm -fr [cloned_mediawiki_scraper_folder]
rm -fr [cloned mediawiki dump generator folder]
```
### 4. Updating MediaWiki Dump Generator
@ -291,9 +291,34 @@ In the above example, `--path` is only necessary if the download path is not the
`dumpgenerator` will also ask you if you want to resume if it finds an incomplete dump in the path where it is downloading.
## Checking dump integrity
If you want to check the XML dump integrity, type this into your command line to count title, page and revision XML tags:
```bash
grep -E '<title(.*?)>' *.xml -c;grep -E '<page(.*?)>' *.xml -c;grep \
"</page>" *.xml -c;grep -E '<revision(.*?)>' *.xml -c;grep "</revision>" *.xml -c
```
You should see something similar to this (not the actual numbers) - the first three numbers should be the same and the last two should be the same as each other:
```bash
580
580
580
5677
5677
```
If your first three numbers or your last two numbers are different, then, your XML dump is corrupt (it contains one or more unfinished ```</page>``` or ```</revision>```). This is not common in small wikis, but large or very large wikis may fail at this due to truncated XML pages while exporting and merging. The solution is to remove the XML dump and re-download, a bit boring, and it can fail again.
## Publishing the dump
Please consider publishing your wiki dump(s). You can do it yourself as explained at WikiTeam's [Publishing the dump](https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump) tutorial.
## Using `launcher`
`launcher` is a way to download a large list of wikis with a single invocation.
`launcher` is a way to download a list of wikis with a single invocation.
Usage:
@ -315,7 +340,7 @@ The `--generator-arg` argument can be used to pass through arguments to the `gen
## Using `uploader`
`uploader` is a way to upload a large set of already-generated wiki dumps to the Internet Archive with a single invocation.
`uploader` is a way to upload a set of already-generated wiki dumps to the Internet Archive with a single invocation.
Usage:
@ -338,31 +363,6 @@ Named arguments (short and long versions):
* `-kf`, `--keysfile`: Path to a file containing Internet Archive API keys. Should contain two lines: the access key, then the secret key. Defaults to `./keys.txt`.
* `-lf`, `--logfile`: Where to store a log of uploaded files (to reduce duplicate work). Defaults to `uploader-X.txt`, where `X` is the final part of the `listfile` path.
## Checking dump integrity
If you want to check the XML dump integrity, type this into your command line to count title, page and revision XML tags:
```bash
grep -E '<title(.*?)>' *.xml -c;grep -E '<page(.*?)>' *.xml -c;grep \
"</page>" *.xml -c;grep -E '<revision(.*?)>' *.xml -c;grep "</revision>" *.xml -c
```
You should see something similar to this (not the actual numbers) - the first three numbers should be the same and the last two should be the same as each other:
```bash
580
580
580
5677
5677
```
If your first three numbers or your last two numbers are different, then, your XML dump is corrupt (it contains one or more unfinished ```</page>``` or ```</revision>```). This is not common in small wikis, but large or very large wikis may fail at this due to truncated XML pages while exporting and merging. The solution is to remove the XML dump and re-download, a bit boring, and it can fail again.
## Publishing the dump
Please consider publishing your wiki dump(s). You can do it yourself as explained at WikiTeam's [Publishing the dump](https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump) tutorial.
## Getting help
* You can read and post in MediaWiki Client Tools' [GitHub Discussions]( https://github.com/orgs/mediawiki-client-tools/discussions).
@ -374,7 +374,7 @@ For information on reporting bugs and proposing changes, please see the [Contrib
## Code of Conduct
`mediawiki-client-tools` is currently working to implement the [Contributor Covenant](https://www.contributor-covenant.org), and you can read our [Code of Conduct](./CODE_OF_CONDUCT.md) based on this.
`mediawiki-client-tools` has a [Code of Conduct](./CODE_OF_CONDUCT.md).
At the moment the only person responsible for reviewing CoC reports is the repository administrator, Elsie Hupp, but we will work towards implementing a broader-based approach to reviews.

Loading…
Cancel
Save