Size Loss Hybrid
Index time 0.0186s 0.0921s 1.0793s
Retrieval
time (abs.)
1.3925e-07s 1.304e-07s 1.659e-07s
Table 4: Performance of K-means-tree with termination strategy
5 Discussion
In conclusion, we proposed a faster similarity search over large corpus embeddings by sacricing some accuracy.
The trade-o between speed and accuracy depends on the specic hyper-parameter choices, allowing exibility
in tuning the model based on the requirements of the application.
In the future, we will consider doing a more complete benchmark over a larger dataset, getting rid of the bias
in programming languages and the implementation.
Also, we will explore more optimizations in terms of both algorithms and hyperparameters. For example,
what if we consider including some supervised data and learned a supervised clustering strategy rst, and use
the trained divider to build the tree? Or is it possible to let two parent nodes point to one child node, making
the structure a graph instead of a tree? Those are all possible methods for improving accuracy.
What’s more, we haven’t explored much in the possibility of utilizing dimensional reduction. The t-SNE gives
a nice separable result in 2d space, but the cluster’s shape is not consistent. In this case, the distance-based
K-means is not functioning, but the SVM can perform well since it divides space using gaps.
In summary, we have found a feasible direction in solving the similarity search problem, but there are still
numerous intriguing avenues to explore, and we will keep contributing to the advancement of our approach for
faster and more accurate similarity search over large corpus embeddings.
References
[1] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transform-
ers for Language Understanding,” inProceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies 2019, pages 4171–4186.
doi:10.18653/v1/N19-1423.url:https://www.aclweb.org/anthology/N19-1423/.
[2] H. Jégou, M. Douze, J. Johnson, L. Hosseini and C. Deng, “Faiss: Similarity search and clustering of dense
vectors library,” Astrophysics Source Code Library, ascl–2210, 2022.
[3] J. Guo, Y. Fan, Q. Ai and W. B. Croft, “A deep relevance matching model for ad-hoc retrieval,” inProceedings
of the 25th ACM international on conference on information and knowledge management 2016, pages 55–
64.
[4] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing
& management,jourvol 24, number 5, pages 513–523, 1988.
[5] S. Robertson, H. Zaragoza andothers, “The probabilistic relevance framework: BM25 and beyond,” Foun-
dations and Trends® in Information Retrieval,jourvol 3, number 4, pages 333–389, 2009.
[6] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research,
jourvol 9, number Nov, pages 2579–2605, 2008.
[7] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen and W.-t. Yih, Dense Passage
Retrieval for Open-Domain Question Answering, 2020. eprint: arXiv:2004.04906.
[8] F. Hamborg, N. Meuschke, C. Breitinger and B. Gipp, “news-please: A Generic News Crawler and Ex-
tractor,” inProceedings of the 15th International Symposium of Information Science Berlin, march 2017,
pages 218–223. doi:10.5281/zenodo.4120316.
[9] J. Nievergelt, H. Hinterberger and K. C. Sevcik, “The Grid File: An Adaptable, Symmetric Multikey
File Structure,” ACM Trans. Database Syst.,jourvol 9, number 1, pages 38–71, march 1984. doi:
10.1145/348.318586.url:https://doi.org/10.1145/348.318586.
8