Cursor for learning.

An AI-native document reader that answers questions with page citations, turns highlights into flashcards, and connects you to two million arXiv papers.

No credit card required · Works with any PDF or EPUB

Attention Is All You Need — Vaswani et al.
Page 3

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension d_k, and values of dimension d_v.

Figure 2: Scaled Dot-Product Attention
Q
MatMul
Scale
Softmax
Output
Chat
How does attention compute the output?
RorobotRorobot
The output is computed as a weighted sum of the values. The weight for each value is determined by a compatibility function between the query and its corresponding key.p.3
Create a flashcard from the highlighted text
RorobotRorobot
Flashcard created
Q: How is attention output computed?
A: Weighted sum of values, weighted by query-key compatibility.
Ask about this document...

Built for people who read to learn

ResearchersPhD studentsGrad studentsSelf-learnersAcademics

Ask anything. Get cited answers.

No more re-reading to find that one passage. Ask a question in plain English, get an answer grounded in your document—with the exact page number.

Try it free
Chat
What makes self-attention different from traditional attention?
RorobotRorobot
Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.p.2

Turn reading into remembering

One highlight becomes a flashcard. One paper becomes a quiz. Your reading stops being a one-time event and becomes lasting knowledge.

Try it free
Flashcards
Flashcard
Q: How is attention output computed?
A: As a weighted sum of the values, where each weight is computed by a compatibility function of the query with the corresponding key.
Source: p.3Due: Tomorrow
Again
Hard
Good
Easy

Two million papers, one click away

Search arXiv without leaving the app. Find a paper, open it, and start reading with full AI tools—no downloading, no tab-switching, no friction.

Try it free
arXiv Search
transformer attention mechanism
Attention Is All You Need
Vaswani et al. · 2017
BERT: Pre-training of Deep Bidirectional Transformers
Devlin et al. · 2018
An Image is Worth 16x16 Words
Dosovitskiy et al. · 2020

How it works

Three steps to deeper reading

01

Drop your document

Upload a PDF, EPUB, or open any paper from arXiv. AI indexes it in the background while you start reading.

02

Read with superpowers

Highlight passages, ask questions about any section, and get answers that cite exact page numbers.

03

Actually retain it

Your highlights become flashcards. Your papers become quizzes. Everything lives in one searchable knowledge hub.

Turn complexity into clarity.

Free forever. No credit card required.