mirror of https://github.com/creekorful/bathyscaphe synced 2024-11-19 15:25:44 +00:00

Go to file

Aloïs Micard b85e9944a2 API: fix startDate/endDate query param		2020-09-23 09:09:08 +02:00
.github/workflows	Cleanup code	2020-08-10 08:01:50 +02:00
api	[#12 ] Allow duplicate resource crawling	2020-09-22 17:54:30 +02:00
build/docker	Finalize whole implementation	2020-09-22 12:10:04 +02:00
cmd	Finalize whole implementation	2020-09-22 12:10:04 +02:00
deployments/docker	Start implementing new architecture	2020-09-21 16:40:12 +02:00
docs	Start implementing new architecture	2020-09-21 16:40:12 +02:00
internal	API: fix startDate/endDate query param	2020-09-23 09:09:08 +02:00
scripts	Last cleanups	2020-09-22 16:07:31 +02:00
.dockerignore	Start implementing new architecture	2020-09-21 16:40:12 +02:00
.gitignore	Initial commit	2020-04-03 17:43:59 +02:00
go.mod	[#12 ] Allow duplicate resource crawling	2020-09-22 17:54:30 +02:00
go.sum	[#12 ] Allow duplicate resource crawling	2020-09-22 17:54:30 +02:00
LICENSE	Initial commit	2020-04-03 17:43:59 +02:00
README.md	Some cleanup	2020-09-22 17:08:24 +02:00
snapcraft.yaml	Release 0.2.0	2020-08-15 17:11:30 +02:00

README.md

Trandoshan dark web crawler

This repository is a complete rewrite of the Trandoshan dark web crawler. Everything has been written inside a single Git repository to ease maintenance.

Why a rewrite?

The first version of Trandoshan (available here) is working great but not really professional, the code start to be a mess, hard to manage since split in multiple repositories, etc..

I have therefore decided to create & maintain the project in this specific directory, where all process code will be available (as a Go module).

How build the crawler

Since the docker image are not available yet, one must run the following script in order to build the crawler fully.

./scripts/build.sh

How to start the crawler

Execute the /scripts/start.sh and wait for all containers to start. You can start the crawler in detached mode by passing --detach to start.sh

Note

Ensure you have at least 3GB of memory as the Elasticsearch stack docker will require 2GB.

How to start the crawling process

Since the API is exposed on localhost:15005, one can use it to start the crawling process:

using trandoshanctl executable:

trandoshanctl schedule https://www.facebookcorewwwi.onion

or using the docker image:

docker run creekorful/trandoshanctl schedule https://www.facebookcorewwwi.onion

this will schedule given URL for crawling.

How to view results

Using trandoshanctl

trandoshanctl search <term>

Using kibana

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named 'resources', and when it asks for the time field, choose 'time'.