You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Aloïs Micard 98bd85e0c4 Bump Go version to 1.14.7		4 years ago
.github/workflows	Cleanup code	4 years ago
build	Bump Go version to 1.14.7	4 years ago
cmd	Add api process	5 years ago
deployments	Fix wrong endpoint being used by scheduler	4 years ago
internal	[#9 ] Prevent from crawling binary, image, etc...	4 years ago
pkg/proto	Move ResourceDto to proto package	4 years ago
scripts	Commit useful scripts	4 years ago
.gitignore	Initial commit	5 years ago
LICENSE	Initial commit	5 years ago
README.md	Update README.md	4 years ago
go.mod	Fix wrong usage of logrus in trandoshan-api	5 years ago
go.sum	Add api process	5 years ago

README.md

Trandoshan dark web crawler

This repository is a complete rewrite of the Trandoshan dark web crawler. Everything has been written inside a single Git repository to ease maintenance.

Why a rewrite?

The first version of Trandoshan (available here) is working great but not really professional, the code start to be a mess, hard to manage since split in multiple repositories, etc..

I have therefore decided to create & maintain the project in this specific directory, where all process code will be available (as a Go module).

How build the crawler

Since the docker image are not available yet, one must run the following script in order to build the crawler fully.

./scripts/build.sh

How to start the crawler

Execute the /scripts/start.sh and wait for all containers to start.

Note

Ensure you have at least 3GB of memory as the Elasticsearch stack docker will require 2GB.

How to start the crawling process

Since the API is explosed on localhost:15005, one can use it to start the crawling process:

feeder --api-uri http://localhost:15005 --url https://www.facebookcorewwwi.onion

this will 'force' the API to publish given URL in crawling queue.

How to access the Kibana UI

Now head out to http://localhost:15004

You will need to create an index pattern named 'resources', and when it asks for the time field, choose 'time'.