Merge pull request #1 from benbusby/main

Update to keep up with source repo
11 months ago · 678557abae
parent db9b26d89d 63a2ea56ed
commit 678557abae
11 changed files with 109 additions and 39 deletions
--- a/README.md
+++ b/README.md
@ -14,7 +14,7 @@
  </tr>
 </table>

-Get Google search results, but without any ads, javascript, AMP links, cookies, or IP address tracking. Easily deployable in one click as a Docker app, and customizable with a single config file. Quick and simple to implement as a primary search engine replacement on both desktop and mobile.
+Get Google search results, but without any ads, JavaScript, AMP links, cookies, or IP address tracking. Easily deployable in one click as a Docker app, and customizable with a single config file. Quick and simple to implement as a primary search engine replacement on both desktop and mobile.

 Contents
 1. [Features](#features)
@ -33,10 +33,11 @@ Contents
 5. [Usage](#usage)
 6. [Extra Steps](#extra-steps)
    1. [Set Primary Search Engine](#set-whoogle-as-your-primary-search-engine)
-    2. [Prevent Downtime (Heroku Only)](#prevent-downtime-heroku-only)
-    3. [Manual HTTPS Enforcement](#https-enforcement)
-    4. [Using with Firefox Containers](#using-with-firefox-containers)
-    5. [Reverse Proxying](#reverse-proxying)
+	2. [Custom Redirecting](#custom-redirecting)
+    3. [Prevent Downtime (Heroku Only)](#prevent-downtime-heroku-only)
+    4. [Manual HTTPS Enforcement](#https-enforcement)
+    5. [Using with Firefox Containers](#using-with-firefox-containers)
+    6. [Reverse Proxying](#reverse-proxying)
        1. [Nginx](#nginx)
 7. [Contributing](#contributing)
 8. [FAQ](#faq)
@ -95,7 +96,7 @@ Provides:
 - Free deployment of app
 - Free HTTPS url (https://\<app name\>.\<username\>\.repl\.co)
    - Supports custom domains
- Downtime after periods of inactivity \([solution 1](https://repl.it/talk/ask/use-this-pingmat1replco-just-enter/28821/101298), [solution 2](https://repl.it/talk/learn/How-to-use-and-setup-UptimeRobot/9003)\)
+- Downtime after periods of inactivity ([solution](https://repl.it/talk/learn/How-to-use-and-setup-UptimeRobot/9003)\)

 ___

@ -398,6 +399,10 @@ There are a few optional environment variables available for customizing a Whoog
 | WHOOGLE_PROXY_PASS   | The password of the proxy server.                                                         |
 | WHOOGLE_PROXY_TYPE   | The type of the proxy server. Can be "socks5", "socks4", or "http".                       |
 | WHOOGLE_PROXY_LOC    | The location of the proxy server (host or ip).                                            |
+| WHOOGLE_USER_AGENT   | The desktop user agent to use. Defaults to a randomly generated one.                      |
+| WHOOGLE_USER_AGENT_MOBILE | The mobile user agent to use. Defaults to a randomly generated one.                  |
+| WHOOGLE_USE_CLIENT_USER_AGENT | Enable to use your own user agent for all requests. Defaults to false.           |
+| WHOOGLE_REDIRECTS    | Specify sites that should be redirected elsewhere. See [custom redirecting](#custom-redirecting). |
 | EXPOSE_PORT          | The port where Whoogle will be exposed.                                                   |
 | HTTPS_ONLY           | Enforce HTTPS. (See [here](https://github.com/benbusby/whoogle-search#https-enforcement)) |
 | WHOOGLE_ALT_TW       | The twitter.com alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
@ -406,7 +411,7 @@ There are a few optional environment variables available for customizing a Whoog
 | WHOOGLE_ALT_TL       | The Google Translate alternative to use. This is used for all "translate ____" searches.  Set to "" to disable. |
 | WHOOGLE_ALT_MD       | The medium.com alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
 | WHOOGLE_ALT_IMG      | The imgur.com alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
-| WHOOGLE_ALT_WIKI     | The wikipedia.com alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
+| WHOOGLE_ALT_WIKI     | The wikipedia.org alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
 | WHOOGLE_ALT_IMDB     | The imdb.com alternative to use when site alternatives are enabled in the config. Set to "" to disable.  |
 | WHOOGLE_ALT_QUORA    | The quora.com alternative to use when site alternatives are enabled in the config. Set to "" to disable. |
 | WHOOGLE_AUTOCOMPLETE | Controls visibility of autocomplete/search suggestions. Default on -- use '0' to disable. |
@ -448,6 +453,7 @@ Same as most search engines, with the exception of filtering by time range.
 To filter by a range of time, append ":past <time>" to the end of your search, where <time> can be `hour`, `day`, `month`, or `year`. Example: `coronavirus updates :past hour`

 ## Extra Steps
+
 ### Set Whoogle as your primary search engine
 *Note: If you're using a reverse proxy to run Whoogle Search, make sure the "Root URL" config option on the home page is set to your URL before going through these steps.*

@ -492,6 +498,32 @@ Browser settings:
    - Manual
      - Under search engines > manage search engines > add, manually enter your Whoogle instance details with a `<whoogle url>/search?q=%s` formatted search URL.

+### Custom Redirecting
+You can set custom site redirects using the `WHOOGLE_REDIRECTS` environment
+variable. A lot of sites, such as Twitter, Reddit, etc, have built-in redirects
+to [Farside links](https://sr.ht/~benbusby/farside), but you may want to define
+your own.
+
+To do this, you can use the following syntax:
+
+```
+WHOOGLE_REDIRECTS="<parent_domain>:<new_domain>"
+```
+
+For example, if you want to redirect from "badsite.com" to "goodsite.com":
+
+```
+WHOOGLE_REDIRECTS="badsite.com:goodsite.com"
+```
+
+This can be used for multiple sites as well, with comma separation:
+
+```
+WHOOGLE_REDIRECTS="badA.com:goodA.com,badB.com:goodB.com"
+```
+
+NOTE: Do not include "http(s)://" when defining your redirect.
+
 ### Prevent Downtime (Heroku only)
 Part of the deal with Heroku's free tier is that you're allocated 550 hours/month (meaning it can't stay active 24/7), and the app is temporarily shut down after 30 minutes of inactivity. Once it becomes inactive, any Whoogle searches will still work, but it'll take an extra 10-15 seconds for the app to come back online before displaying the result, which can be frustrating if you're in a hurry.

@ -568,7 +600,7 @@ Under the hood, Whoogle is a basic Flask app with the following structure:
    - `opensearch.xml`: A template used for supporting [OpenSearch](https://developer.mozilla.org/en-US/docs/Web/OpenSearch).
    - `imageresults.html`: An "experimental" template used for supporting the "Full Size" image feature on desktop.
  - `static/<css|js>`
-    - CSS/Javascript files, should be self-explanatory
+    - CSS/JavaScript files, should be self-explanatory
  - `static/settings`
    - Key-value JSON files for establishing valid configuration values

@ -607,7 +639,7 @@ I'm a huge fan of Searx though and encourage anyone to use that instead if they

 **Why does the image results page look different?**

-A lot of the app currently piggybacks on Google's existing support for fetching results pages with Javascript disabled. To their credit, they've done an excellent job with styling pages, but it seems that the image results page - particularly on mobile - is a little rough. Moving forward, with enough interest, I'd like to transition to fetching the results and parsing them into a unique Whoogle-fied interface that I can style myself.
+A lot of the app currently piggybacks on Google's existing support for fetching results pages with JavaScript disabled. To their credit, they've done an excellent job with styling pages, but it seems that the image results page - particularly on mobile - is a little rough. Moving forward, with enough interest, I'd like to transition to fetching the results and parsing them into a unique Whoogle-fied interface that I can style myself.

 ## Public Instances

@ -621,16 +653,17 @@ A lot of the app currently piggybacks on Google's existing support for fetching
 | [https://s.tokhmi.xyz](https://s.tokhmi.xyz) | 🇺🇸 US | Multi-choice | ✅ |
 | [https://search.sethforprivacy.com](https://search.sethforprivacy.com) | 🇩🇪 DE | English | |
 | [https://whoogle.dcs0.hu](https://whoogle.dcs0.hu) | 🇭🇺 HU | Multi-choice | |
-| [https://whoogle.esmailelbob.xyz](https://whoogle.esmailelbob.xyz) | 🇨🇦 CA | Multi-choice | |
 | [https://gowogle.voring.me](https://gowogle.voring.me) | 🇺🇸 US | Multi-choice | |
-| [https://whoogle.privacydev.net](https://whoogle.privacydev.net) | 🇳🇱 NL | English | |
+| [https://whoogle.privacydev.net](https://whoogle.privacydev.net) | 🇩🇪 DE | English | |
 | [https://wg.vern.cc](https://wg.vern.cc) | 🇺🇸 US | English |  |
 | [https://whoogle.hxvy0.gq](https://whoogle.hxvy0.gq) | 🇨🇦 CA | Turkish Only | ✅ |
 | [https://whoogle.hostux.net](https://whoogle.hostux.net) | 🇫🇷 FR | Multi-choice | |
 | [https://whoogle.lunar.icu](https://whoogle.lunar.icu) | 🇩🇪 DE | Multi-choice | ✅ |
 | [https://wgl.frail.duckdns.org](https://wgl.frail.duckdns.org) | 🇧🇷 BR | Multi-choice | |
-| [https://whoogle.no-logs.com/)(https://whoogle.no-logs.com/) | 🇸🇪 SE | Multi-choice | |
+| [https://whoogle.no-logs.com](https://whoogle.no-logs.com/) | 🇸🇪 SE | Multi-choice | |
 | [https://search.rubberverse.xyz](https://search.rubberverse.xyz) | 🇵🇱 PL | English | |
+| [https://whoogle.ftw.lol](https://whoogle.ftw.lol) | 🇩🇪 DE | Multi-choice | |
+| [https://whoogle-search--replitcomreside.repl.co](https://whoogle-search--replitcomreside.repl.co) | 🇺🇸 US | English |  |


 * A checkmark in the "Cloudflare" category here refers to the use of the reverse proxy, [Cloudflare](https://cloudflare.com). The checkmark will not be listed for a site which uses Cloudflare DNS but rather the proxying service which grants Cloudflare the ability to monitor traffic to the website.
@ -642,7 +675,7 @@ A lot of the app currently piggybacks on Google's existing support for fetching
 | [http://whoglqjdkgt2an4tdepberwqz3hk7tjo4kqgdnuj77rt7nshw2xqhqad.onion](http://whoglqjdkgt2an4tdepberwqz3hk7tjo4kqgdnuj77rt7nshw2xqhqad.onion) | 🇺🇸 US |  Multi-choice
 | [http://nuifgsnbb2mcyza74o7illtqmuaqbwu4flam3cdmsrnudwcmkqur37qd.onion](http://nuifgsnbb2mcyza74o7illtqmuaqbwu4flam3cdmsrnudwcmkqur37qd.onion) | 🇩🇪 DE |  English
 | [http://whoogle.vernccvbvyi5qhfzyqengccj7lkove6bjot2xhh5kajhwvidqafczrad.onion](http://whoogle.vernccvbvyi5qhfzyqengccj7lkove6bjot2xhh5kajhwvidqafczrad.onion/) | 🇺🇸 US | English |
-| [http://whoogle.g4c3eya4clenolymqbpgwz3q3tawoxw56yhzk4vugqrl6dtu3ejvhjid.onion](http://whoogle.g4c3eya4clenolymqbpgwz3q3tawoxw56yhzk4vugqrl6dtu3ejvhjid.onion/) | 🇳🇱 NL | English |
+| [http://whoogle.g4c3eya4clenolymqbpgwz3q3tawoxw56yhzk4vugqrl6dtu3ejvhjid.onion](http://whoogle.g4c3eya4clenolymqbpgwz3q3tawoxw56yhzk4vugqrl6dtu3ejvhjid.onion/) | 🇩🇪 DE | English |

 #### I2P Instances

--- a/app/filter.py
+++ b/app/filter.py
@ -561,18 +561,19 @@ class Filter:
        is enabled
        """
        for site, alt in SITE_ALTS.items():
-            for div in self.soup.find_all('div', text=re.compile(site)):
-                # Use the number of words in the div string to determine if the
-                # string is a result description (shouldn't replace domains used
-                # in desc text).
-                # Also ignore medium.com replacements since these are handled
+            if site != "medium.com" and alt != "":
+                # Ignore medium.com replacements since these are handled
                # specifically in the link description replacement, and medium
                # results are never given their own "card" result where this
                # replacement would make sense.
-                if site == 'medium.com' or len(div.string.split(' ')) > 1:
-                    continue
-
-                div.string = div.string.replace(site, alt)
+                # Also ignore if the alt is empty, since this is used to indicate
+                # that the alt is not enabled.
+                for div in self.soup.find_all('div', text=re.compile(site)):
+                    # Use the number of words in the div string to determine if the
+                    # string is a result description (shouldn't replace domains used
+                    # in desc text).
+                    if len(div.string.split(' ')) == 1:
+                        div.string = div.string.replace(site, alt)

            for link in self.soup.find_all('a', href=True):
                # Search and replace all link descriptions
@ -596,7 +597,7 @@ class Filter:
                # replaced (i.e. 'philomedium.com' should stay as it is).
                if 'medium.com' in link_str:
                    if link_str.startswith('medium.com') or '.medium.com' in link_str:
-                        link_str = 'farside.link/scribe' + link_str[
+                        link_str = SITE_ALTS['medium.com'] + link_str[
                            link_str.find('medium.com') + len('medium.com'):]
                    new_desc.string = link_str
                else:
--- a/app/models/config.py
+++ b/app/models/config.py
@ -254,7 +254,8 @@ class Config:
                key = self._get_fernet_key(self.preferences_key)

                config = Fernet(key).decrypt(
-                    brotli.decompress(urlsafe_b64decode(preferences.encode()))
+                    brotli.decompress(urlsafe_b64decode(
+                        preferences.encode() + b'=='))
                )

                config = pickle.loads(brotli.decompress(config))
@ -262,7 +263,8 @@ class Config:
                config = {}
        elif mode == 'u': # preferences are not encrypted
            config = pickle.loads(
-                brotli.decompress(urlsafe_b64decode(preferences.encode()))
+                brotli.decompress(urlsafe_b64decode(
+                    preferences.encode() + b'=='))
            )
        else: # preferences are incorrectly formatted
            config = {}
--- a/app/request.py
+++ b/app/request.py
@ -73,6 +73,14 @@ def send_tor_signal(signal: Signal) -> bool:


 def gen_user_agent(is_mobile) -> str:
+    user_agent = os.environ.get('WHOOGLE_USER_AGENT', '')
+    user_agent_mobile = os.environ.get('WHOOGLE_USER_AGENT_MOBILE', '')
+    if user_agent and not is_mobile:
+        return user_agent
+
+    if user_agent_mobile and is_mobile:
+        return user_agent_mobile
+
    firefox = random.choice(['Choir', 'Squier', 'Higher', 'Wire']) + 'fox'
    linux = random.choice(['Win', 'Sin', 'Gin', 'Fin', 'Kin']) + 'ux'

@ -261,7 +269,7 @@ class Request:
            return []

    def send(self, base_url='', query='', attempt=0,
-             force_mobile=False) -> Response:
+             force_mobile=False, user_agent='') -> Response:
        """Sends an outbound request to a URL. Optionally sends the request
        using Tor, if enabled by the user.

@ -277,10 +285,14 @@ class Request:
            Response: The Response object returned by the requests call

        """
-        if force_mobile and not self.mobile:
-            modified_user_agent = self.modified_user_agent_mobile
+        use_client_user_agent = int(os.environ.get('WHOOGLE_USE_CLIENT_USER_AGENT', '0'))
+        if user_agent and use_client_user_agent == 1:
+            modified_user_agent = user_agent
        else:
-            modified_user_agent = self.modified_user_agent
+            if force_mobile and not self.mobile:
+                modified_user_agent = self.modified_user_agent_mobile
+            else:
+                modified_user_agent = self.modified_user_agent

        headers = {
            'User-Agent': modified_user_agent
--- a/app/routes.py
+++ b/app/routes.py
@ -557,6 +557,15 @@ def window():
    )


+@app.route(f'/robots.txt')
+def robots():
+    response = make_response(
+'''User-Agent: *
+Disallow: /''', 200)
+    response.mimetype = 'text/plain'
+    return response
+
+
@app.errorhandler(404)
 def page_not_found(e):
    return render_template('error.html', error_message=str(e)), 404
--- a/app/static/settings/translations.json
+++ b/app/static/settings/translations.json
@ -1056,12 +1056,12 @@
        "books": "Pirtûk",
        "anon-view": "Dîtina Nenas",
        "": "--",
-        "qdr:h": "Saet berê",
-        "qdr:d": "24 saetên borî",
+        "qdr:h": "Demjimêra borî",
+        "qdr:d": "24 Demjimêrên borî",
        "qdr:w": "Hefteya borî",
        "qdr:m": "Meha borî",
        "qdr:y": "Sala borî",
-        "config-time-period": "Dem Period"
+        "config-time-period": "Pêşsazkariyên demê"
    },
    "lang_th": {
        "search": "ค้นหา",
--- a/app/utils/misc.py
+++ b/app/utils/misc.py
@ -79,3 +79,10 @@ def get_abs_url(url, page_url):
    elif url.startswith('./'):
        return f'{page_url}{url[2:]}'
    return url
+
+
+def list_to_dict(lst: list) -> dict:
+    if len(lst) < 2:
+        return {}
+    return {lst[i].replace(' ', ''): lst[i+1].replace(' ', '')
+            for i in range(0, len(lst), 2)}
--- a/app/utils/results.py
+++ b/app/utils/results.py
@ -1,5 +1,6 @@
 from app.models.config import Config
 from app.models.endpoint import Endpoint
+from app.utils.misc import list_to_dict
 from bs4 import BeautifulSoup, NavigableString
 import copy
 from flask import current_app
@ -43,6 +44,9 @@ SITE_ALTS = {
    'quora.com': os.getenv('WHOOGLE_ALT_QUORA', 'farside.link/quetre')
 }

+# Include custom site redirects from WHOOGLE_REDIRECTS
+SITE_ALTS.update(list_to_dict(re.split(',|:', os.getenv('WHOOGLE_REDIRECTS', ''))))
+

 def contains_cjko(s: str) -> bool:
    """This function check whether or not a string contains Chinese, Japanese,
--- a/app/utils/search.py
+++ b/app/utils/search.py
@ -144,7 +144,8 @@ class Search:
                      and not g.user_request.mobile)

        get_body = g.user_request.send(query=full_query,
-                                       force_mobile=view_image)
+                                       force_mobile=view_image,
+                                       user_agent=self.user_agent)

        # Produce cleanable html soup from response
        get_body_safed = get_body.text.replace("&lt;","andlt;").replace("&gt;","andgt;")
--- a/misc/instances.txt
+++ b/misc/instances.txt
@ -4,7 +4,6 @@ https://search.dr460nf1r3.org
 https://s.tokhmi.xyz
 https://search.sethforprivacy.com
 https://whoogle.dcs0.hu
-https://whoogle.esmailelbob.xyz
 https://whoogle.lunar.icu
 https://gowogle.voring.me
 https://whoogle.privacydev.net
@ -17,3 +16,5 @@ https://whoogle3.ungovernable.men
 https://wgl.frail.duckdns.org
 https://whoogle.no-logs.com
 https://search.rubberverse.xyz
+https://whoogle.ftw.lol
+https://whoogle-search--replitcomreside.repl.co
--- a/requirements.txt
+++ b/requirements.txt
@ -7,10 +7,10 @@ cffi==1.15.1
 chardet==5.1.0
 click==8.1.3
 cryptography==3.3.2; platform_machine == 'armv7l'
-cryptography==39.0.1; platform_machine != 'armv7l'
+cryptography==41.0.0; platform_machine != 'armv7l'
 cssutils==2.6.0
 defusedxml==0.7.1
-Flask==2.2.3
+Flask==2.3.2
 idna==3.4
 itsdangerous==2.1.2
 Jinja2==3.1.2
@ -21,16 +21,16 @@ pluggy==1.0.0
 pycodestyle==2.10.0
 pycparser==2.21
 pyOpenSSL==19.1.0; platform_machine == 'armv7l'
-pyOpenSSL==23.0.0; platform_machine != 'armv7l'
+pyOpenSSL==23.2.0; platform_machine != 'armv7l'
 pyparsing==3.0.9
 PySocks==1.7.1
 pytest==7.2.1
 python-dateutil==2.8.2
-requests==2.28.2
+requests==2.31.0
 soupsieve==2.4
 stem==1.8.1
 urllib3==1.26.14
 waitress==2.1.2
 wcwidth==0.2.6
-Werkzeug==2.2.3
+Werkzeug==2.3.3
 python-dotenv==0.21.1