162 points by tvvocold a year ago
Even with a pool of proxies, I would expect an instance of this "metasearch engine" to quickly get banned by the other search engines. The same IP running thousands of queries and scraping its content (which is against their ToS) should be easily detectable.
I've been running multiple SearX instances for goodness knows how long and this has never happened to me. I'm not aware of this happening either.
SearX uses multiple sources for queries, so you'd have to be banned by quite a few search engines to stop it being useful.
Also relevant is filtron, an application firewall built-in to SearX that rate limits searches.
 - https://asciimoo.github.io/searx/admin/filtron.html
Building on that issue, I'd like to add that it would be nice to have a feature that alerts a user that certain a search engine is denying requests. It's visible in the logs or settings somewhere, but usually I find myself wondering for a while why my search queries aren't accurate before heading off to figure out why.
Still a great project though, I use it every day.
At least for me, next to each result is a list of the engines that returned that result. I run searx through Tor, so I occasionally find that Google stops returning results for a few minutes.
It doesn't happen often, but it's easy to tell when it does because none of the first page results have "google" next to them, while of course normally most of them would.
I'm curious. How does DuckDuckGo do this?
By paying for API access to other search engines. Yahoo used to offer that publicly but eventually shut the service down (but kept DDG as a legacy customer)
That's interesting. I wonder the cost.
Approximately $1 per 1000 requests https://www.programmableweb.com/news/yahoos-new-search-api-p...
Thanks for the link!
This is self-hosted, so I'm assuming it's running under the assumption that each person hosts their own instance and uses that instance. The number of queries coming from the instance in that case wouldn't look too out of the ordinary.
Then that defeats the purpose of trying to be privacy focused if your search queries aren't mixed with other people's queries.
I also run searx self-hosted, configured to proxy all its queries through Tor. Occasionally one of the engines doesn't return results (probably due to blocking), which is barely noticeable since several others still work, but normally all the engines including Google return results.
Since searx doesn't store cookies returned by the search engines, and I'm using it through Tor, I think this is a significant improvement over sending all my search queries to Google directly from my laptop.
I can't grasp how you got that idea. Do you not know what self-hosted means?
the engine craws the web and saves its data locally.
this locally saved data can be queried/searched.
So yes, in your search engine, there will only be your own searches. But these searches are only visible to your own servers/services.
> I can't grasp how you got that idea.
By reading the link?
> the engine craws the web and saves its data locally
...Saves an index of the whole web, locally?
This is not what SearX does. It queries other search engines.
yes, i mixed up the engines and commented without verifying which it was.
i'm sorry for that. i just thought it wasn't necessary to edit as somebody sufficiently pointed out how mistaken i was 9 days ago.
This is not how searx works.
google will give you a captcha every once in a while but they never actually stop you from using their service.
It will also sometimes ban you completely (not even the CAPTCHA works, solving it just gets you another) for ~2h. I've triggered it manually, usually when trying very specific queries and multiple variations in quick succession and also going through to the "end" of the result pages.
getting banned in that case must've been extremely aggravating.
I wouldn't worry about it. If they get banned, they will probably apply some ML technique to circumvent any CAPTCHAs to get access again. Also, this can run from the user's computer so it would actually be quite hard to detect that the results are being aggregated, and stripped from ads.
On a related note, have any of you tried FindX?
Looks promising but haven't used it much yet.
I like this part "it draws its results from its own bot that crawls the web". There aren't too many that use their own bot.
Tsignal (https://deepsearch.tsignal.io/) is another that uses it's own bot with a little AI tossed into the mix. And... it's currently not accessible:(
Wanted to add that I'm currently on Opera and using an extension named Search All. After conducting a search via your default search engine, this extension places a bar with a list of user configurable search engines. This allows you to search alternatives easily using the same keywords.
One great feature is that if you go directly to a search engine, click on the Search All icon, it almost always identifies it with correct parameters and can be easily added. Just added findX to my bar (for testing).
I plan on going back to Firefox and wish FF had something like this (part of the reason I'm posting this).
There is also http://yacy.net, peer-to-peer distributed free search engine.
I wonder what the rationale behind listing the site's CA in the public instance list (https://github.com/asciimoo/searx/wiki/Searx-instances).
Does this have any advantages over StartPage?
Or DuckDuckGo, for that matter?
* Can be self-hosted.
* Queries other search engines besides just Google.
* Has more search options.
This is awesome. I have been wanting to build something like this for a while but never had the time.
There's also pears search, which died sometime after being funded by Mozilla. I don't think it was malicious intent, but no one can explain why the project is inactive.