Feature Set - ScholarAI

Our product suite currently falls into one of three categories

Search
Document content extraction (including images)
Document management

At the level of individual paper metadata we retrieve, our database is based on Semantic Scholar’s fantastic API. Any references to SS_ID inside of ScholarAI correspond to Semantic Scholar’s paperId field

Search

In the category of search, we’ve created a vector database representing 100s of millions of papers that can be semantically searched by their titles and abstracts. These search results are also enhanced by

Open source access information
Ranking based on semantic similarity to the initial query, citation counts, or publication dates
Optional generative_mode for directly asking questions to the searched content.

The search returns a variable number of results depending on the cosine similarity of the found results. You may see anything between 0 to 25 results for a semantic search.

Suggested Endpoints

Search Papers for LLMs

Semantically search papers, optimized for agents

Search Papers in Your Code

Semantically search papers, with lots of metadata for use in your applications

Explore the literature map

Find adjacent papers

RAG over multiple papers

RAG over a large batch of papers

Content Extraction

Internally, we use a neural-network powered OCR system to read PDFs for which we’ve been provided a manually uploaded file or URL. With this OCR system, we can read PDFs even if they are older scans and isolate images. Currently, these extractions can still be blocked by publishers and are subject to copyright and open-access rules. For any file manually uploaded to our system, following expectations regarding copyright is the responsibility of the user.

The OCR system can effectively recognize text and related figures from a paper, but has a known weakness in accurately identifying figures by their label. We prune images from the returned data that seem poor quality.

Suggested Endpoints

Summarize a paper's full text

Extract paper sections, piece by piece

Ask questions to a PDF

RAG the best sections of a PDF related to a question

RAG over multiple papers

RAG over a large batch of papers

Document management

Every user can create projects, which are storage spaces for related agentic work and documents. For the sake of this API, these projects can collect papers and then allow mass RAG query over the uploaded data. The projects are additionally

Privately encrypted for each user
Interfaced with document store features
Capable of batch analysis over all or select documents inside of the project

Presently, projects only store PDFs.

Add a paper to a project

Extract paper sections, piece by piece

Ask questions to a PDF

RAG the best sections of a PDF related to a question

RAG over multiple papers

RAG over a large batch of papers

Putting it all together

All systems here rely on one central ID: the paper_id. This is an abstracted way to identify a paper by DOI, it’s URL, it’s identifier inside of a project, etc. A paper_id will always begin with something akin to DOI:..., PROJ:..., SSID:.... Every endpoint, whether searching or scanning a project, uses a given paper_id to find relevant metadata which can then extend into other functions. For example, when you search for papers, we will try to create a PDF_URL:... paper_id for every paper returned by the query. A PDF_URL:... can then be fed into a project endpoint or RAG endpoint for further extraction.

API Documentation

Endpoints

​Search

​Suggested Endpoints

Search Papers for LLMs

Search Papers in Your Code

Explore the literature map

RAG over multiple papers

​Content Extraction

​Suggested Endpoints

Summarize a paper's full text

Ask questions to a PDF

RAG over multiple papers

​Document management

Add a paper to a project

Ask questions to a PDF

RAG over multiple papers

​Putting it all together

Search

Suggested Endpoints

Content Extraction

Suggested Endpoints

Document management

Putting it all together