Cursor for learning.

An AI-native document reader that answers questions with page citations, turns highlights into flashcards, and connects you to two million arXiv papers.

Get started free

No credit card required · Works with any PDF or EPUB

Attention Is All You Need — Vaswani et al.

Page 3

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension d_k, and values of dimension d_v.

Figure 2: Scaled Dot-Product Attention

MatMul

Scale

Softmax

Output

Chat

How does attention compute the output?

Rorobot

The output is computed as a weighted sum of the values. The weight for each value is determined by a compatibility function between the query and its corresponding key.p.3

Create a flashcard from the highlighted text

Rorobot

Flashcard created

Q: How is attention output computed?

A: Weighted sum of values, weighted by query-key compatibility.

Ask about this document...

Built for people who read to learn

ResearchersPhD studentsGrad studentsSelf-learnersAcademics

Ask anything. Get cited answers.

No more re-reading to find that one passage. Ask a question in plain English, get an answer grounded in your document—with the exact page number.

Try it free

Chat

What makes self-attention different from traditional attention?

Rorobot

Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.p.2

Turn reading into remembering

One highlight becomes a flashcard. One paper becomes a quiz. Your reading stops being a one-time event and becomes lasting knowledge.

Try it free

Flashcards

Flashcard

Q: How is attention output computed?

A: As a weighted sum of the values, where each weight is computed by a compatibility function of the query with the corresponding key.

Source: p.3Due: Tomorrow

Again

Hard

Good

Easy

Two million papers, one click away

Search arXiv without leaving the app. Find a paper, open it, and start reading with full AI tools—no downloading, no tab-switching, no friction.

Try it free

arXiv Search

transformer attention mechanism

Attention Is All You Need

Vaswani et al. · 2017

BERT: Pre-training of Deep Bidirectional Transformers

Devlin et al. · 2018

An Image is Worth 16x16 Words

Dosovitskiy et al. · 2020

How it works

Three steps to deeper reading

Drop your document

Upload a PDF, EPUB, or open any paper from arXiv. AI indexes it in the background while you start reading.

Read with superpowers

Highlight passages, ask questions about any section, and get answers that cite exact page numbers.

Actually retain it

Your highlights become flashcards. Your papers become quizzes. Everything lives in one searchable knowledge hub.

Turn complexity into clarity.

Get started free

Free forever. No credit card required.