Monfse Icon
Monfse
← Back to Blog
December 12, 202510 min read

Ranking Today's AI Literature Search Engines: A Data-Driven Comparison

A methodical assessment of how five leading AI search engines perform on a standardized research question.

At monfse.com, we routinely evaluate emerging research tools as part of our product strategy – particularly when deciding which search engines to integrate or prioritize. Over time, we developed an efficient internal benchmarking process that we find informative. Today, we're sharing that process publicly, along with the results from our most recent comparison.

AI Literature Search Engines Comparison

This analysis focuses solely on search quality, not UI / UX, summaries, writing features, or downstream functionality.

Why We Conducted This Comparison

AI literature search tools are becoming increasingly popular. Many offer overlapping features, while others cater to different stages of the research lifecycle. Despite this growing ecosystem, there remains very little structured, transparent benchmarking of the core task researchers care about most: How well does the tool find relevant, high-quality academic papers?

Our goal was not to evaluate technical architectures or LLM design decisions – that work is covered extensively by others like Aaron Tay. Instead, we wanted to apply a simple, consistent, and replicable process to compare outputs across tools.

Tools Included in the Evaluation

We assessed five of the most widely used AI literature search engines:

  • Consensus
  • Elicit
  • Scholar Labs
  • SciSpace
  • Undermind

These tools differ significantly in scope. Some are pure search engines (e.g., Scholar Labs). Others include broad writing and workflow features (e.g., SciSpace). For this study, we evaluated only the search results themselves.

Methodology

Our methodology was intentionally simple:

  1. Use a single research question: "What are the GDP impacts of climate change?"
  2. Enter the same query into each tool.
  3. Extract the top 10 results from each engine.
  4. Standardize the metadata (title, abstract, authors, year, journal, citation counts).
  5. Remove tool-specific summaries and replace them with manually collected abstracts.
  6. Evaluate each set of 10 papers along three dimensions:
    • Relevance
    • Scholarly quality (citations, peer-reviewed status)
    • Recency
  7. Run the anonymized results through three general LLMs (ChatGPT, Claude, Gemini) and ask each model to rank the sets independently.
  8. Review the outputs ourselves and provide a human ranking.

Although simple, the process required extensive manual cleaning. Many tools lack consistent metadata fields, leading to hours of reconciliation to ensure a fair comparison.

Why We Chose a Simple Approach

Some existing "white papers" in this area use complex multi-query scoring systems to arrive at comparative rankings. However, greater complexity often enables unintentional bias – or, in some cases, deliberate marketing. To illustrate: a well-known (unnamed) example indicates only 2/10 Google Scholar results are relevant to the topic being searched – a conclusion we found far-fetched. Indeed, the methodology of the paper was difficult to interpret.

Methodology Comparison

By keeping the process simple and transparent – one query, top 10 results – we avoided methodological adjustments that could advantage one tool over another.

Limitations

  • Only top 10 results per tool were analyzed.
  • Not designed to be a peer-reviewed benchmark.
  • Some tools may perform better on niche or long-tail topics.
  • We relied on LLM judgments for part of the ranking process, which introduces model-based preferences.

Nonetheless, the consistency of the results across all three models – and in terms of our own conclusions – suggests the results contain meaningful information.

Results

The ranking was remarkably consistent:

Search Engine Rankings

1. Undermind – The Clear Leader

Across every evaluation dimension, Undermind surfaced:

  • The most highly cited papers
  • The most recent papers
  • The most methodologically relevant papers
  • The strongest direct alignment with the research question

Average citation count exceeded 1,000, and the average publication year was 2018.

2. Consensus

Performed well across relevance and recency, and was ranked second by all three general LLMs and us.

3. Scholar Labs

Solid results, but with more working papers and less methodological diversity.

4. Elicit

Frequently returned older papers and more bottom-up modeling approaches (CGE/IAM), but with high topical relevance. We ranked Elicit #3 over Scholar Labs.

5. SciSpace

Showed several concerning patterns:

  • A retracted paper surfaced in the top 10
  • Duplicate items appeared in the same query
  • Several foundational climate-economics papers were missing

These issues produced the weakest overall performance.

Methodological Differences Matter

One unexpected finding was the degree to which the selections of methodologies shaped the quantitative results surfaced by each engine.

  • Elicit tended to return CGE/IAM models → lower GDP impact estimates
  • Undermind tended to return econometric studies → 3-5× higher estimated impacts

This is not a judgment about which approach is "better." Instead, it calls out the importance of:

  • Understanding what kinds of papers an engine is implicitly optimizing for
  • Using multiple engines when conducting policy-relevant or systematic research

Researchers relying on a single search engine may unintentionally inherit the methodological biases of that system.

Main Empirical Finding

Based on the quantitative estimates across all engines (using one estimate per paper and excluding extreme or scenario-specific outliers), the median projected global GDP loss from climate change is approximately 3-4% by 2100.

Why This Matters for monfse.com

monfse.com is not a search engine. Our goal is not to replace Google Scholar, Undermind, Consensus, or any other discovery tool. Instead, we are building the workflow layer that helps researchers:

  • Add papers from any search engine or database
  • Clean and normalize metadata
  • Extract key information
  • Organize, annotate, and synthesize in a cohesive workflow
  • Collaborate across teams
  • Produce transparent, rigorous reviews

During our demonstration, we imported the top 10 results for Undermind directly into monfse.com, enriched them with DOI-based metadata, extracted main findings, and retrieved full-text articles. This is precisely the type of workflow that monfse.com is designed to support – particularly for large reviews where papers come from a mix of tools, databases, and exports.

monfse.com Workflow

Takeaways

  • Undermind currently offers the strongest search performance for this topic.
  • Consensus, Scholar Labs, and Elicit are competitive but not equivalent.
  • SciSpace returns some useful results but shows notable limitations.
  • Search engines differ sharply in the types of studies they surface.
  • Using multiple search engines is essential for serious research.
  • monfse.com acts as the consolidation, enrichment, and synthesis layer for these tools.

If you'd like access to the monfse.com library or the anonymized comparison document for this analysis, feel free to reach out to john@monfse.com.

Try monfse.com free here: /app