Ask HN: Where can I find the latest info for essay ranking, spam filtering?
I used to do a lot of work with spam filtering. I once worked at a company that had set up hundreds of marketing websites, all of which were the start of a sales funnel that fed into Marketo and then into LeadSpace and then in Salesforce. A good response from potential customers looked like this:
"I am interested in the pricing for MaxaMegaAI. Do you have a free tier for a startup with less than 10 developers?"
or:
"Can your ETL tool handle different systems for geospatial calculations?"
Bad responses looked like:
"None"
or:
"Damn"
Or:
"sdefedflkjlkjsdfsdlkfjlskdfj"
I wrote simply machine learning scripts to automate some of our spam filtering.
I have the impression this has come a long way?
I think this category of machine learning is sometimes called "essay ranking."
I've been away from this kind of work for 7 years. I assume nowadays, with LLMs, there might be some advanced techniques that can be easily implemented?
Can someone point me towards a good resource?
I process text through
https://www.sbert.net/
and apply a classical machine learning algorithm such as the probability calibrated SVM. This usually beats bag-of-words classifiers as it is able to suss some of the meaning of words. The advantage of this approach is that it very fast (maybe 30 seconds to reliably train a model)
It is also possible to “fine tune” a BERT family model using tools from Huggingface like so
https://huggingface.co/docs/transformers/training
my experience is that this takes more like 30 minutes to train a model but the process is not so reliable. For some tasks this performs better than the first approach but I haven’t gotten it to reliably improve on my current models for my tasks.
I am planning on fine-tuning a T5 model when I have a problem that I think it will do well on.
Thank you, this is great.