Our product suite currently falls into one of three categories

  • Search
  • Document content extraction (including images)
  • Document management

At the level of individual paper metadata we retrieve, our database is based on Semantic Scholar’s fantastic API. Any references to SS_ID inside of ScholarAI correspond to Semantic Scholar’s paperId field

In the category of search, we’ve created a vector database representing 100s of millions of papers that can be semantically searched by their titles and abstracts.

These search results are also enhanced by

  • Open source access information
  • Ranking based on semantic similarity to the initial query, citation counts, or publication dates
  • Optional generative_mode for directly asking questions to the searched content.

The search returns a variable number of results depending on the cosine similarity of the found results. You may see anything between 0 to 25 results for a semantic search.

Suggested Endpoints

Content Extraction

Internally, we use a neural-network powered OCR system to read PDFs for which we’ve been provided a manually uploaded file or URL. With this OCR system, we can read PDFs even if they are older scans and isolate images.

Currently, these extractions can still be blocked by publishers and are subject to copyright and open-access rules. For any file manually uploaded to our system, following expectations regarding copyright is the responsibility of the user.

The OCR system can effectively recognize text and related figures from a paper, but has a known weakness in accurately identifying figures by their label. We prune images from the returned data that seem poor quality.

Suggested Endpoints

Document management

Every user can create projects, which are storage spaces for related agentic work and documents. For the sake of this API, these projects can collect papers and then allow mass RAG query over the uploaded data.

The projects are additionally

  • Privately encrypted for each user
  • Interfaced with document store features
  • Capable of batch analysis over all or select documents inside of the project

Presently, projects only store PDFs.

Putting it all together

All systems here rely on one central ID: the paper_id. This is an abstracted way to identify a paper by DOI, it’s URL, it’s identifier inside of a project, etc. A paper_id will always begin with something akin to DOI:..., PROJ:..., SSID:.... Every endpoint, whether searching or scanning a project, uses a given paper_id to find relevant metadata which can then extend into other functions.

For example, when you search for papers, we will try to create a PDF_URL:... paper_id for every paper returned by the query. A PDF_URL:... can then be fed into a project endpoint or RAG endpoint for further extraction.