whoogle-search/test/test_results.py

from bs4 import BeautifulSoup
from app.filter import Filter
from app.utils.session import generate_user_key
from datetime import datetime
from dateutil.parser import *
from urllib.parse import urlparse

from test.conftest import demo_config


def get_search_results(data):
    secret_key = generate_user_key()
    soup = Filter(user_key=secret_key).clean(
        BeautifulSoup(data, 'html.parser'))

    main_divs = soup.find('div', {'id': 'main'})
    assert len(main_divs) > 1

    result_divs = []
    for div in main_divs:
        # Result divs should only have 1 inner div
        if (len(list(div.children)) != 1
                or not div.findChild()
                or 'div' not in div.findChild().name):
            continue

        result_divs.append(div)

    return result_divs


def test_get_results(client):
    rv = client.get('/search?q=test')
    assert rv._status_code == 200

    # Depending on the search, there can be more
    # than 10 result divs
    results = get_search_results(rv.data)
    assert len(results) >= 10
    assert len(results) <= 15


def test_post_results(client):
    rv = client.post('/search', data=dict(q='test'))
    assert rv._status_code == 200

    # Depending on the search, there can be more
    # than 10 result divs
    results = get_search_results(rv.data)
    assert len(results) >= 10
    assert len(results) <= 15


def test_translate_search(client):
    rv = client.post('/search', data=dict(q='translate hola'))
    assert rv._status_code == 200

    # Pretty weak test, but better than nothing
    str_data = str(rv.data)
    assert 'iframe' in str_data
    assert 'lingva.ml/auto/en/ hola' in str_data


def test_block_results(client):
    rv = client.post('/search', data=dict(q='pinterest'))
    assert rv._status_code == 200

    has_pinterest = False
    for link in BeautifulSoup(rv.data, 'html.parser').find_all('a', href=True):
        if 'pinterest.com' in urlparse(link['href']).netloc:
            has_pinterest = True
            break

    assert has_pinterest

    demo_config['block'] = 'pinterest.com'
    rv = client.post('/config', data=demo_config)
    assert rv._status_code == 302

    rv = client.post('/search', data=dict(q='pinterest'))
    assert rv._status_code == 200

    for link in BeautifulSoup(rv.data, 'html.parser').find_all('a', href=True):
        assert 'pinterest.com' not in urlparse(link['href']).netloc


# TODO: Unit test the site alt method instead -- the results returned
# are too unreliable for this test in particular.
# def test_site_alts(client):
    # rv = client.post('/search', data=dict(q='twitter official account'))
    # assert b'twitter.com/Twitter' in rv.data

    # client.post('/config', data=dict(alts=True))
    # assert json.loads(client.get('/config').data)['alts']

    # rv = client.post('/search', data=dict(q='twitter official account'))
    # assert b'twitter.com/Twitter' not in rv.data
    # assert b'nitter.net/Twitter' in rv.data


def test_recent_results(client):
    times = {
        'past year': 365,
        'past month': 31,
        'past week': 7
    }

    for time, num_days in times.items():
        rv = client.post('/search', data=dict(q='test :' + time))
        result_divs = get_search_results(rv.data)

        current_date = datetime.now()
        for div in [_ for _ in result_divs if _.find('span')]:
            date_span = div.find('span').decode_contents()
            if not date_span or len(date_span) > 15 or len(date_span) < 7:
                continue

            try:
                date = parse(date_span)
                # Date can have a little bit of wiggle room
                assert (current_date - date).days <= (num_days + 5)
            except ParserError:
                pass
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`from bs4 import BeautifulSoup`
			`from app.filter import Filter`
Switch to single Fernet key per session This moves away from the previous (messy) approach of using two separate keys for decrypting text and element URLs separately and regenerating them for new searches. The current implementation of sessions is not very reliable, which lead to keys being regenerated too soon, which would break page navigation. Until that can be addressed, the single key per session approach should work a lot better. Fixes #250 Fixes #90 2021-04-01 04:23:30 +00:00			`from app.utils.session import generate_user_key`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`from datetime import datetime`
			`from dateutil.parser import *`
Block websites from search results via user config (#304) * Block websites in search results via user config Adds a new config field "Block" to specify a comma separated list of websites to block in search results. This is applied for all searches. * Add test for blocking sites from search results * Document WHOOGLE_CONFIG_BLOCK usage * Strip '-site:' filters from query in header template The 'behind the scenes' site filter applied for blocked sites was appearing in the query field when navigating between search categories (all -> images -> news, etc). This prevents the filter from appearing in all except "images", since the image category uses a separate header. This should eventually be addressed when the image page can begin using the standard whoogle header, but until then, the filter will still appear for image searches. 2021-05-07 15:45:53 +00:00			`from urllib.parse import urlparse`

			`from test.conftest import demo_config`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00

			`def get_search_results(data):`
Switch to single Fernet key per session This moves away from the previous (messy) approach of using two separate keys for decrypting text and element URLs separately and regenerating them for new searches. The current implementation of sessions is not very reliable, which lead to keys being regenerated too soon, which would break page navigation. Until that can be addressed, the single key per session approach should work a lot better. Fixes #250 Fixes #90 2021-04-01 04:23:30 +00:00			`secret_key = generate_user_key()`
			`soup = Filter(user_key=secret_key).clean(`
PEP-8: Fix formatting issues, add CI workflow (#161) Enforces PEP-8 formatting for all python code Adds a github action build for checking pep8 formatting using pycodestyle 2020-12-17 21:06:47 +00:00			`BeautifulSoup(data, 'html.parser'))`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00
			`main_divs = soup.find('div', {'id': 'main'})`
			`assert len(main_divs) > 1`

			`result_divs = []`
			`for div in main_divs:`
			`# Result divs should only have 1 inner div`
PEP-8: Fix formatting issues, add CI workflow (#161) Enforces PEP-8 formatting for all python code Adds a github action build for checking pep8 formatting using pycodestyle 2020-12-17 21:06:47 +00:00			`if (len(list(div.children)) != 1`
			`or not div.findChild()`
			`or 'div' not in div.findChild().name):`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`continue`

			`result_divs.append(div)`

			`return result_divs`


Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`def test_get_results(client):`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`rv = client.get('/search?q=test')`
			`assert rv._status_code == 200`

Modified result length test 2020-04-15 23:54:38 +00:00			`# Depending on the search, there can be more`
			`# than 10 result divs`
Add lingva translation support in search (#360) * Add support for Lingva translations in results Searches that contain the word "translate" and are normal search queries (i.e. not news/images/video/etc) now create an iframe to a Lingva url to translate the user's search using their configured search language. The Lingva url can be configured using the WHOOGLE_ALT_TL env var, or will fall back to the official Lingva instance url (lingva.ml). For more info, visit https://github.com/TheDavidDelta/lingva-translate * Add basic test for lingva results * Allow user specified lingva instances through csp frame-src * Fix pep8 issue 2021-06-15 14:14:42 +00:00			`results = get_search_results(rv.data)`
			`assert len(results) >= 10`
			`assert len(results) <= 15`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00

Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`def test_post_results(client):`
			`rv = client.post('/search', data=dict(q='test'))`
			`assert rv._status_code == 200`

			`# Depending on the search, there can be more`
			`# than 10 result divs`
Add lingva translation support in search (#360) * Add support for Lingva translations in results Searches that contain the word "translate" and are normal search queries (i.e. not news/images/video/etc) now create an iframe to a Lingva url to translate the user's search using their configured search language. The Lingva url can be configured using the WHOOGLE_ALT_TL env var, or will fall back to the official Lingva instance url (lingva.ml). For more info, visit https://github.com/TheDavidDelta/lingva-translate * Add basic test for lingva results * Allow user specified lingva instances through csp frame-src * Fix pep8 issue 2021-06-15 14:14:42 +00:00			`results = get_search_results(rv.data)`
			`assert len(results) >= 10`
			`assert len(results) <= 15`


			`def test_translate_search(client):`
			`rv = client.post('/search', data=dict(q='translate hola'))`
			`assert rv._status_code == 200`

			`# Pretty weak test, but better than nothing`
			`str_data = str(rv.data)`
			`assert 'iframe' in str_data`
			`assert 'lingva.ml/auto/en/ hola' in str_data`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00

Block websites from search results via user config (#304) * Block websites in search results via user config Adds a new config field "Block" to specify a comma separated list of websites to block in search results. This is applied for all searches. * Add test for blocking sites from search results * Document WHOOGLE_CONFIG_BLOCK usage * Strip '-site:' filters from query in header template The 'behind the scenes' site filter applied for blocked sites was appearing in the query field when navigating between search categories (all -> images -> news, etc). This prevents the filter from appearing in all except "images", since the image category uses a separate header. This should eventually be addressed when the image page can begin using the standard whoogle header, but until then, the filter will still appear for image searches. 2021-05-07 15:45:53 +00:00			`def test_block_results(client):`
			`rv = client.post('/search', data=dict(q='pinterest'))`
			`assert rv._status_code == 200`

			`has_pinterest = False`
			`for link in BeautifulSoup(rv.data, 'html.parser').find_all('a', href=True):`
			`if 'pinterest.com' in urlparse(link['href']).netloc:`
			`has_pinterest = True`
			`break`

			`assert has_pinterest`

			`demo_config['block'] = 'pinterest.com'`
			`rv = client.post('/config', data=demo_config)`
			`assert rv._status_code == 302`

			`rv = client.post('/search', data=dict(q='pinterest'))`
			`assert rv._status_code == 200`

			`for link in BeautifulSoup(rv.data, 'html.parser').find_all('a', href=True):`
			`assert 'pinterest.com' not in urlparse(link['href']).netloc`


Fix nojs lxml constructor The BeautifulSoup constructur in gen_nojs needed to explicitly set features='lxml' to silence a warning from the library. Also temporarily disabled the site alts test since the results are too unreliable. This should be moved to a unit test instead. 2020-12-12 00:21:32 +00:00			`# TODO: Unit test the site alt method instead -- the results returned`
			`# are too unreliable for this test in particular.`
			`# def test_site_alts(client):`
			`# rv = client.post('/search', data=dict(q='twitter official account'))`
			`# assert b'twitter.com/Twitter' in rv.data`

			`# client.post('/config', data=dict(alts=True))`
			`# assert json.loads(client.get('/config').data)['alts']`

			`# rv = client.post('/search', data=dict(q='twitter official account'))`
			`# assert b'twitter.com/Twitter' not in rv.data`
			`# assert b'nitter.net/Twitter' in rv.data`
Allow setting site alts using environment vars (#155) * Add ability to configure site alts w/ env vars Site alternatives (i.e. twitter.com -> nitter.net) can now be configured using environment variables: WHOOGLE_ALT_TW='nitter.net' # twitter alt WHOOGLE_ALT_YT='invidio.us' # youtube alt WHOOGLE_ALT_IG='bibliogram.art/u' # instagram alt Updated testing to confirm results have been modified. * Add site alt vars to docker settings and readme 2020-12-05 22:01:21 +00:00

Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`def test_recent_results(client):`
			`times = {`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`'past year': 365,`
			`'past month': 31,`
			`'past week': 7`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`}`

			`for time, num_days in times.items():`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 00:59:33 +00:00			`rv = client.post('/search', data=dict(q='test :' + time))`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`result_divs = get_search_results(rv.data)`

			`current_date = datetime.now()`
Small update to results time period test Updated to ensure a child span element is available before running a test to verify the correct time range for the result. Need to come up with a better way of ensuring uniform results across multiple tests, since otherwise periodic changes in the returned results can cause tests to fail. 2020-06-28 16:52:53 +00:00			`for div in [_ for _ in result_divs if _.find('span')]:`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`date_span = div.find('span').decode_contents()`
Fixed search results test For datetime spans in time-filtered search results, anything less than 7 characters or more than 15 can be guaranteed to not be properly formatted dates (either "mm dd yyyy" or "xx days/months/weeks ago") 2020-04-27 00:11:02 +00:00			`if not date_span or len(date_span) > 15 or len(date_span) < 7:`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`continue`

			`try:`
			`date = parse(date_span)`
PEP-8: Fix formatting issues, add CI workflow (#161) Enforces PEP-8 formatting for all python code Adds a github action build for checking pep8 formatting using pycodestyle 2020-12-17 21:06:47 +00:00			`# Date can have a little bit of wiggle room`
			`assert (current_date - date).days <= (num_days + 5)`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-15 23:41:53 +00:00			`except ParserError:`
Feature: country and safe search config options (#71) * Added country and safe search config options * Updated handling of parser error in results test * Improved handling of default country * Added 1px empty gif fallback as a replacement for images that fail to load 2020-05-23 20:27:23 +00:00			`pass`