mirror of https://github.com/creekorful/bathyscaphe synced 2024-11-19 15:25:44 +00:00

Go to file

Aloïs Micard afdeb6d6e2 Fix log.sh (oops)		2020-09-24 09:43:09 +02:00
.github/workflows	Cleanup code	2020-08-10 08:01:50 +02:00
api	Lint source code	2020-09-23 17:19:18 +02:00
build/docker	Finalize whole implementation	2020-09-22 12:10:04 +02:00
cmd	Finalize whole implementation	2020-09-22 12:10:04 +02:00
deployments/docker	Start implementing new architecture	2020-09-21 16:40:12 +02:00
docs	Start implementing new architecture	2020-09-21 16:40:12 +02:00
internal	API contract: now pagine-able!	2020-09-23 17:17:19 +02:00
scripts	Fix log.sh (oops)	2020-09-24 09:43:09 +02:00
.dockerignore	Start implementing new architecture	2020-09-21 16:40:12 +02:00
.gitignore	Initial commit	2020-04-03 17:43:59 +02:00
go.mod	[#12 ] Allow duplicate resource crawling	2020-09-22 17:54:30 +02:00
go.sum	[#12 ] Allow duplicate resource crawling	2020-09-22 17:54:30 +02:00
LICENSE	Initial commit	2020-04-03 17:43:59 +02:00
README.md	Last update to README.md	2020-09-24 08:23:22 +02:00
snapcraft.yaml	Release 0.2.0	2020-08-15 17:11:30 +02:00

README.md

Trandoshan dark web crawler

This repository is a complete rewrite of the Trandoshan dark web crawler. Everything has been written inside a single Git repository to ease maintenance.

Why a rewrite?

The first version of Trandoshan (available here) is working great but not really professional, the code start to be a mess, hard to manage since split in multiple repositories, etc.

I have therefore decided to create & maintain the project in this specific repository, where all process code will be available (as a Go module).

How to start the crawler

To start the crawler, one just need to execute the following command:

$ ./scripts/start.sh

and wait for all containers to start.

Notes

You can start the crawler in detached mode by passing --detach to start.sh.
Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.

How to initiate crawling

Since the API is exposed on localhost:15005, one can use it to start the crawling process:

using trandoshanctl executable:

$ trandoshanctl schedule https://www.facebookcorewwwi.onion

or using the docker image:

$ docker run creekorful/trandoshanctl --api-uri <uri> schedule https://www.facebookcorewwwi.onion

(you'll need to specify the api uri if you use the docker container)

this will schedule given URL for crawling.

How to speed up crawling

If one want to speed up the crawling process, he can scale the instance of crawling process in order to increase performances. This may be done by issuing the following command after the crawler is started:

$ ./scripts/scale.sh crawler=5

this will set the number of crawler instance to 5.

How to view results

Using trandoshanctl

trandoshanctl search <term>

Using kibana

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named 'resources', and when it asks for the time field, choose 'time'.

How to hack the crawler

If you've made a change to one of the crawler process and wish to use the updated version when running start.sh you just need to issue the following command:

$ ./script/build.sh

this will rebuild all crawler images using local changes. After that just run start.sh again to have the updated version running.