Ask HN: Where can I find the latest info for essay ranking, spam filtering?

2 points by lkrubner 12 days ago

I used to do a lot of work with spam filtering. I once worked at a company that had set up hundreds of marketing websites, all of which were the start of a sales funnel that fed into Marketo and then into LeadSpace and then in Salesforce. A good response from potential customers looked like this:

"I am interested in the pricing for MaxaMegaAI. Do you have a free tier for a startup with less than 10 developers?"

or:

"Can your ETL tool handle different systems for geospatial calculations?"

Bad responses looked like:

"None"

or:

"Damn"

Or:

"sdefedflkjlkjsdfsdlkfjlskdfj"

I wrote simply machine learning scripts to automate some of our spam filtering.

I have the impression this has come a long way?

I think this category of machine learning is sometimes called "essay ranking."

I've been away from this kind of work for 7 years. I assume nowadays, with LLMs, there might be some advanced techniques that can be easily implemented?

Can someone point me towards a good resource?

PaulHoule 12 days ago

I process text through

https://www.sbert.net/

and apply a classical machine learning algorithm such as the probability calibrated SVM. This usually beats bag-of-words classifiers as it is able to suss some of the meaning of words. The advantage of this approach is that it very fast (maybe 30 seconds to reliably train a model)

It is also possible to “fine tune” a BERT family model using tools from Huggingface like so

https://huggingface.co/docs/transformers/training

my experience is that this takes more like 30 minutes to train a model but the process is not so reliable. For some tasks this performs better than the first approach but I haven’t gotten it to reliably improve on my current models for my tasks.

I am planning on fine-tuning a T5 model when I have a problem that I think it will do well on.

lkrubner 12 days ago

Thank you, this is great.