Machine Learning: Full-Text Search in JavaScript – Relevance Scoring (2015)

diegolo 5 years ago

"a general search can be machine learning" I don't get this sentence: Machine learning is about building a mathematical model of sample data, known as "training data".

If you want to talk about machine learning and search you should probably talk about learning to rank (https://en.m.wikipedia.org/wiki/Learning_to_rank)

snotrockets 5 years ago

I'd argue that you're too restrictive in your definition. e.g. unsupervised clustering has no sample training data.
The usual definition (due to Mitchell) is that machine learning is a system s.t. its performance on a given task improves by past experience.
- thegginthesky 5 years ago
  
  Actually, any unsupervised method, including clustering, still has training data. The only difference is it doesn't have a target y variable in the training set to minimize the error metric, hence the name unsupervised.
  But the definition you mention is right. Yet, any dataset that you use to fit your model will be your training set, even if you don't have a train test split or the like, because you used it to train your model over.
  
  snotrockets 5 years ago
  
  K-means has no "training data" per se.

inertiatic 5 years ago

Search is now machine learning? Interesting introduction to the topic otherwise.

softwaredoug 5 years ago

I would say this isn't machine learning, but relevance in general is an interesting topic to apply supervised learning. Of course the training data is the hard part,
An article on the topic, https://opensourceconnections.com/blog/2017/08/03/search-as-... (disclaimer I wrote it...)
- inertiatic 5 years ago
  
  I also work on this field so I do have an idea of what's possible if you apply machine learning techniques to improve relevance rankings.
  But to my intuition, basic search doesn't feel like a machine learning task. After reading some of the responses to my post however I'm trying to come up with a meaningful reason why I wouldn't consider IDF to be machine learning, given that it is updated as more documents enter the corpus and your system "learns" to re-rank existing result sets based on these new documents.
Cybiote 5 years ago

These categorizations tend to be arbitrary and inconsistent because what is an intelligence is probably subjective. People consider kNN and naive bayes to be machine learning. One is "just" sorting and the other is just counting. The learning in naive bayes and even in some higher order bayesian networks is of a similar flavor as the count structures generated for IR.
Because we understand how these algorithms work, we can always reduce them to just this or that. Prediction in many linear classifiers is just dot products. relu neural networks are just lots of clamped dot products. Random projections on simple count data can generate word embeddings.
Whether or not something is merely model-fitting, super-scaling, compression or AI, AI art and machine learning will depend on the field it originated from. It's indisputable that the algorithms are so reducible but I tend to think that we should care more about functional capabilities when compared to an appropriate subset of a known intelligence's abilities than details of implementation.
ma2rten 5 years ago

In general search can be machine learning. Google certainly uses machine learning as part of it's ranking.
I guess you can make the argument that even tf-idf as described in the article is a form of unsupervised machine learning because you obtain ("learn") the idf from the data.
- jahewson 5 years ago
  
  TF-IDF is a feature extracted from the data, much like a simple count of words, but it is not learned. It is simply computed. An example of learned features are word embeddings where it is necessary to train on data to obtain them.
  If you want to apply machine learning to search then you need clickstream data, embeddings, or learned feature weights.
  
  yorwba 5 years ago
  
  Word embeddings are also "simply computed." If you use GloVe, then the vectors are obtained by factoring a matrix of co-occurence counts.
  The difference between machine learning and "simple" feature extraction is mostly just in the choice of metaphors used to describe the computation, not in any fundamental properties.
  
  ma2rten 5 years ago
  
  Right. Naive Bayes is considered to be a machine learning algorithm, but also consists of just "simple counting".
  
  drongoking 5 years ago
  
  Your distinction between TFIDF as simply computed vs embeddings as learned is odd and artificial. Both are computations from data, but TFIDF has an understandable closed form and word embeddings do not. As for machine learning, it has to do with improvement and doesn't even necessarily need data.
  
  jahewson 5 years ago
  
  I think you’re right. It’s the improvement in the learning process that’s the important bit. TF-IDF lacks that.
pilooch 5 years ago

Search is often formulated as 'learning to rank'.
- inertiatic 5 years ago
  
  AFAIK learning to rank refers to things more advanced than simple TFIDF.

humbleMouse 5 years ago

Reading your site on my phone and it reloads every 10 seconds. Annoying.

rajangdavis 5 years ago

Curious to see how this might compare against Postgres's full-text search.

The text search vector type is pretty much a poor man's bag of words model (with removing stop words and some lemmatization) but instead of counts, you get placement of where the words occur.

eggie5 5 years ago

he generated query-document features. Now he just needs to collect relevance labels for the documents, then he can learn a ranker a la LTR.

ElD0C 5 years ago

(2015)

magma17 5 years ago

relevance==frequency?

anything is ML now...