whoogle-search/app/request.py

from app import rhyme
from io import BytesIO
import pycurl
import urllib.parse as urlparse

# Base search url
SEARCH_URL = 'https://www.google.com/search?gbv=1&q='

MOBILE_UA = '{}/5.0 (Android 0; Mobile; rv:54.0) Gecko/54.0 {}/59.0'
DESKTOP_UA = '{}/5.0 (X11; {} x86_64; rv:75.0) Gecko/20100101 {}/75.0'

# Valid query params
VALID_PARAMS = ['tbs', 'tbm', 'start', 'near']


def gen_user_agent(normal_ua):
    is_mobile = 'Android' in normal_ua or 'iPhone' in normal_ua

    mozilla = rhyme.get_rhyme('Mo') + rhyme.get_rhyme('zilla')
    firefox = rhyme.get_rhyme('Fire') + rhyme.get_rhyme('fox')
    linux = rhyme.get_rhyme('Lin') + 'ux'

    if is_mobile:
        return MOBILE_UA.format(mozilla, firefox)
    else:
        return DESKTOP_UA.format(mozilla, linux, firefox)


def gen_query(query, args, near_city=None):
    param_dict = {key: '' for key in VALID_PARAMS}
    # Use :past(hour/day/week/month/year) if available
    # example search "new restaurants :past month"
    if ':past' in query:
        time_range = str.strip(query.split(':past', 1)[-1])
        param_dict['tbs'] = '&tbs=qdr:' + str.lower(time_range[0])

    # Ensure search query is parsable
    query = urlparse.quote(query)

    # Pass along type of results (news, images, books, etc)
    if 'tbm' in args:
        param_dict['tbm'] = '&tbm=' + args.get('tbm')

    # Get results page start value (10 per page, ie page 2 start val = 20)
    if 'start' in args:
        param_dict['start'] = '&start=' + args.get('start')

    # Search for results near a particular city, if available
    if near_city is not None:
        param_dict['near'] = '&near=' + urlparse.quote(near_city)

    for val in param_dict.values():
        if not val or val is None:
            continue
        query += val

    return query


class Request:
    def __init__(self, normal_ua):
        self.modified_user_agent = gen_user_agent(normal_ua)

    def __getitem__(self, name):
        return getattr(self, name)

    def send(self, base_url=SEARCH_URL, query='', return_bytes=False):
        response_header = []

        b_obj = BytesIO()
        crl = pycurl.Curl()
        crl.setopt(crl.URL, base_url + query)
        crl.setopt(crl.USERAGENT, self.modified_user_agent)
        crl.setopt(crl.WRITEDATA, b_obj)
        crl.setopt(crl.HEADERFUNCTION, response_header.append)
        crl.setopt(pycurl.FOLLOWLOCATION, 1)
        crl.perform()
        crl.close()

        if return_bytes:
            return b_obj.getvalue()
        else:
            return b_obj.getvalue().decode('unicode-escape', 'ignore')
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00			`from app import rhyme`
			`from io import BytesIO`
			`import pycurl`
			`import urllib.parse as urlparse`

			`# Base search url`
			`SEARCH_URL = 'https://www.google.com/search?gbv=1&q='`

			`MOBILE_UA = '{}/5.0 (Android 0; Mobile; rv:54.0) Gecko/54.0 {}/59.0'`
			`DESKTOP_UA = '{}/5.0 (X11; {} x86_64; rv:75.0) Gecko/20100101 {}/75.0'`

Added POST search, encrypted query strings, refactoring The implementation of POST search support comes with a few benefits. The most apparent is the avoidance of search queries appearing in web server logs -- instead of the prior GET approach (i.e. /search?q=my+search+query), using POST requests with the query stored in the request body creates logs that simply appear as "/search". Since a lot of relative links are generated in the results page, I came up with a way to generate a unique key at run time that is used to encrypt any query strings before sending to the user. This benefits both regular text queries as well as fetching of image links and means that web logs will only show an encrypted string where a link or query string might slip through. Unfortunately, GET search requests still need to be supported, as it doesn't seem that Firefox (on iOS) supports loading search engines by their opensearch.xml file, but instead relies on manual entry of a search query string. Once this is updated, I'll probably remove GET request search support. 2020-04-29 00:19:34 +00:00			`# Valid query params`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`VALID_PARAMS = ['tbs', 'tbm', 'start', 'near']`
Added POST search, encrypted query strings, refactoring The implementation of POST search support comes with a few benefits. The most apparent is the avoidance of search queries appearing in web server logs -- instead of the prior GET approach (i.e. /search?q=my+search+query), using POST requests with the query stored in the request body creates logs that simply appear as "/search". Since a lot of relative links are generated in the results page, I came up with a way to generate a unique key at run time that is used to encrypt any query strings before sending to the user. This benefits both regular text queries as well as fetching of image links and means that web logs will only show an encrypted string where a link or query string might slip through. Unfortunately, GET search requests still need to be supported, as it doesn't seem that Firefox (on iOS) supports loading search engines by their opensearch.xml file, but instead relies on manual entry of a search query string. Once this is updated, I'll probably remove GET request search support. 2020-04-29 00:19:34 +00:00
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
			`def gen_user_agent(normal_ua):`
			`is_mobile = 'Android' in normal_ua or 'iPhone' in normal_ua`

			`mozilla = rhyme.get_rhyme('Mo') + rhyme.get_rhyme('zilla')`
			`firefox = rhyme.get_rhyme('Fire') + rhyme.get_rhyme('fox')`
			`linux = rhyme.get_rhyme('Lin') + 'ux'`

			`if is_mobile:`
			`return MOBILE_UA.format(mozilla, firefox)`
			`else:`
			`return DESKTOP_UA.format(mozilla, linux, firefox)`


Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`def gen_query(query, args, near_city=None):`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`param_dict = {key: '' for key in VALID_PARAMS}`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00			`# Use :past(hour/day/week/month/year) if available`
			`# example search "new restaurants :past month"`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`if ':past' in query:`
			`time_range = str.strip(query.split(':past', 1)[-1])`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`param_dict['tbs'] = '&tbs=qdr:' + str.lower(time_range[0])`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
			`# Ensure search query is parsable`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`query = urlparse.quote(query)`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
			`# Pass along type of results (news, images, books, etc)`
			`if 'tbm' in args:`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`param_dict['tbm'] = '&tbm=' + args.get('tbm')`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
			`# Get results page start value (10 per page, ie page 2 start val = 20)`
			`if 'start' in args:`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`param_dict['start'] = '&start=' + args.get('start')`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
			`# Search for results near a particular city, if available`
			`if near_city is not None:`
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`param_dict['near'] = '&near=' + urlparse.quote(near_city)`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00
Restructured valid params checking, added empty query redirect 2020-04-30 00:53:58 +00:00			`for val in param_dict.values():`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`if not val or val is None:`
			`continue`
			`query += val`

			`return query`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00

			`class Request:`
			`def __init__(self, normal_ua):`
			`self.modified_user_agent = gen_user_agent(normal_ua)`

			`def __getitem__(self, name):`
			`return getattr(self, name)`

Added image proxying, refactored filter class Images were previously directly fetched from google search results, which was a potential privacy hazard. All image sources are now modified to be passed through shoogle's routing first, which will then fetch raw image data and pass it through to the user. Filter class was refactored to split the primary clean method into smaller, more manageable submethods. 2020-04-28 02:21:36 +00:00			`def send(self, base_url=SEARCH_URL, query='', return_bytes=False):`
Refactoring of user requests and routing Curl requests and user agent related functionality was moved to its own request class. Routes was refactored to only include strictly routing related functionality. Filter class was cleaned up (had routing/request related logic in here, which didn't make sense) 2020-04-24 02:59:43 +00:00			`response_header = []`

			`b_obj = BytesIO()`
			`crl = pycurl.Curl()`
			`crl.setopt(crl.URL, base_url + query)`
			`crl.setopt(crl.USERAGENT, self.modified_user_agent)`
			`crl.setopt(crl.WRITEDATA, b_obj)`
			`crl.setopt(crl.HEADERFUNCTION, response_header.append)`
			`crl.setopt(pycurl.FOLLOWLOCATION, 1)`
			`crl.perform()`
			`crl.close()`

Added image proxying, refactored filter class Images were previously directly fetched from google search results, which was a potential privacy hazard. All image sources are now modified to be passed through shoogle's routing first, which will then fetch raw image data and pass it through to the user. Filter class was refactored to split the primary clean method into smaller, more manageable submethods. 2020-04-28 02:21:36 +00:00			`if return_bytes:`
			`return b_obj.getvalue()`
			`else:`
Updated formatting and setup instructions Switched encoding from utf-8 to unicode-escape in an effort to support multiple languages besides English. Updated image results page formatting to fix bad image links (added TODO for adding full res image link for each image result). Updated README to include libcurl and libssl install instructions for manual setup. 2020-05-04 01:32:47 +00:00			`return b_obj.getvalue().decode('unicode-escape', 'ignore')`