SearX: Privacy-respecting metasearch engine

halflings 6 years ago

Even with a pool of proxies, I would expect an instance of this "metasearch engine" to quickly get banned by the other search engines. The same IP running thousands of queries and scraping its content (which is against their ToS) should be easily detectable.

_b8r0 6 years ago

I've been running multiple SearX instances for goodness knows how long and this has never happened to me. I'm not aware of this happening either.
SearX uses multiple sources for queries, so you'd have to be banned by quite a few search engines to stop it being useful.
Also relevant is filtron[1], an application firewall built-in to SearX that rate limits searches.
[1] - https://asciimoo.github.io/searx/admin/filtron.html
snowpanda 6 years ago

Building on that issue, I'd like to add that it would be nice to have a feature that alerts a user that certain a search engine is denying requests. It's visible in the logs or settings somewhere, but usually I find myself wondering for a while why my search queries aren't accurate before heading off to figure out why.
Still a great project though, I use it every day.
- jbg_ 6 years ago
  
  At least for me, next to each result is a list of the engines that returned that result. I run searx through Tor, so I occasionally find that Google stops returning results for a few minutes.
  It doesn't happen often, but it's easy to tell when it does because none of the first page results have "google" next to them, while of course normally most of them would.
jazoom 6 years ago

I'm curious. How does DuckDuckGo do this?
- hanbura 6 years ago
  
  By paying for API access to other search engines. Yahoo used to offer that publicly but eventually shut the service down (but kept DDG as a legacy customer)
  
  jazoom 6 years ago
  
  That's interesting. I wonder the cost.
  
  mtmail 6 years ago
  
  Approximately $1 per 1000 requests https://www.programmableweb.com/news/yahoos-new-search-api-p...
  
  jazoom 6 years ago
  
  Thanks for the link!
VoidWhisperer 6 years ago

This is self-hosted, so I'm assuming it's running under the assumption that each person hosts their own instance and uses that instance. The number of queries coming from the instance in that case wouldn't look too out of the ordinary.
- lawl 6 years ago
  
  Then that defeats the purpose of trying to be privacy focused if your search queries aren't mixed with other people's queries.
  
  jbg_ 6 years ago
  
  I also run searx self-hosted, configured to proxy all its queries through Tor. Occasionally one of the engines doesn't return results (probably due to blocking), which is barely noticeable since several others still work, but normally all the engines including Google return results.
  Since searx doesn't store cookies returned by the search engines, and I'm using it through Tor, I think this is a significant improvement over sending all my search queries to Google directly from my laptop.
  
  y4mi 6 years ago
  
  I can't grasp how you got that idea. Do you not know what self-hosted means?
  the engine craws the web and saves its data locally. this locally saved data can be queried/searched. So yes, in your search engine, there will only be your own searches. But these searches are only visible to your own servers/services.
  
  halflings 6 years ago
  
  > I can't grasp how you got that idea.
  By reading the link?
  > the engine craws the web and saves its data locally ...Saves an index of the whole web, locally?
  This is not what SearX does. It queries other search engines.
  
  y4mi 6 years ago
  
  yes, i mixed up the engines and commented without verifying which it was.
  i'm sorry for that. i just thought it wasn't necessary to edit as somebody sufficiently pointed out how mistaken i was 9 days ago.
  
  jbg_ 6 years ago
  
  This is not how searx works.
saas_co_de 6 years ago

google will give you a captcha every once in a while but they never actually stop you from using their service.
- userbinator 6 years ago
  
  It will also sometimes ban you completely (not even the CAPTCHA works, solving it just gets you another) for ~2h. I've triggered it manually, usually when trying very specific queries and multiple variations in quick succession and also going through to the "end" of the result pages.
  
  dajohnson89 6 years ago
  
  getting banned in that case must've been extremely aggravating.
amelius 6 years ago

I wouldn't worry about it. If they get banned, they will probably apply some ML technique to circumvent any CAPTCHAs to get access again. Also, this can run from the user's computer so it would actually be quite hard to detect that the results are being aggregated, and stripped from ads.

snowpanda 6 years ago

On a related note, have any of you tried FindX?

https://www.findx.com/

https://github.com/privacore/open-source-search-engine

Looks promising but haven't used it much yet.

O1111OOO 6 years ago

I like this part "it draws its results from its own bot that crawls the web". There aren't too many that use their own bot.
Tsignal (https://deepsearch.tsignal.io/) is another that uses it's own bot with a little AI tossed into the mix. And... it's currently not accessible:(
Wanted to add that I'm currently on Opera and using an extension named Search All[0]. After conducting a search via your default search engine, this extension places a bar with a list of user configurable search engines. This allows you to search alternatives easily using the same keywords.
One great feature is that if you go directly to a search engine, click on the Search All icon, it almost always identifies it with correct parameters and can be easily added. Just added findX to my bar (for testing).
I plan on going back to Firefox and wish FF had something like this (part of the reason I'm posting this).
[0] https://addons.opera.com/en/extensions/details/search-all/?d...

fghtr 6 years ago

There is also http://yacy.net, peer-to-peer distributed free search engine.

finnn 6 years ago

I wonder what the rationale behind listing the site's CA in the public instance list (https://github.com/asciimoo/searx/wiki/Searx-instances).

bussie 6 years ago

Does this have any advantages over StartPage?

_phaq 6 years ago

* Can be self-hosted.
* Queries other search engines besides just Google.
* Has more search options.
tonysdg 6 years ago

Or DuckDuckGo, for that matter?

saas_co_de 6 years ago

This is awesome. I have been wanting to build something like this for a while but never had the time.

ReverseCold 6 years ago

There's also pears search, which died sometime after being funded by Mozilla. I don't think it was malicious intent, but no one can explain why the project is inactive.