Examples

Lambda + S3 checkpoints

A resumable AWS Lambda crawler that survives 15-minute timeouts via S3 checkpoints.

docs/examples/lambda-s3.mdx·edit on github ↗·

AWS Lambda has a hard 15-minute execution limit. A crawl that wants to survive longer than that needs to checkpoint and resume across invocations. With yoink, that's about 20 lines of code.

The architecture

lambda + s3 / topologyexamples/checkpoint_resume.py

triggeraws

EventBridge

rate(14 minutes)

compute15-min cap

Lambda function

python3.11 · 1024 MB

appyoink

Crawler

CheckpointManager.from_uri(s3://…)

↕ read/write

persistent states3://my-crawl-bucket/checkpoints/example.jsonl

metadata

run header

page × N

streamed appends

state × M

every flush_interval

↻

resume=True: when EventBridge fires the next invocation, the Crawler reads the checkpoint object, restores visited / queue / filtered, and continues exactly where the previous run stopped. Repeats until the crawl finishes or you disable the rule.

14-min budget per invocation · S3 stores progress · scheduled re-invocations resume until done.

Lambda handler

import asyncio
import json
import os
 
from yoink import Crawler, CrawlConfig, CheckpointManager
 
CHECKPOINT_BUCKET = os.environ["CHECKPOINT_BUCKET"]
CHECKPOINT_KEY    = os.environ["CHECKPOINT_KEY"]      # e.g. "crawls/example-com.jsonl"
START_URL         = os.environ["START_URL"]
MAX_PAGES         = int(os.environ.get("MAX_PAGES", "10000"))
 
# Reserve ~30s for Lambda housekeeping
TIME_BUDGET_SECONDS = 14 * 60
 
async def crawl_chunk():
    config = CrawlConfig(
        max_depth=4,
        max_pages=MAX_PAGES,
        max_concurrency=20,
        requests_per_second=10.0,
    )
 
    checkpoint_uri = f"s3://{CHECKPOINT_BUCKET}/{CHECKPOINT_KEY}"
    checkpoint = CheckpointManager.from_uri(checkpoint_uri, flush_interval=50)
 
    crawler = Crawler(config=config, checkpoint_manager=checkpoint)
 
    # Resume picks up if checkpoint exists, else starts fresh
    pages = await asyncio.wait_for(
        crawler.crawl(START_URL, resume=True),
        timeout=TIME_BUDGET_SECONDS,
    )
    return pages
 
def handler(event, context):
    try:
        pages = asyncio.run(crawl_chunk())
        done = len(pages) >= MAX_PAGES
    except asyncio.TimeoutError:
        # Hit the time budget — we'll resume on the next invocation
        done = False
        pages = []
 
    return {
        "statusCode": 200,
        "body": json.dumps({
            "pages_so_far": len(pages),
            "done": done,
        }),
    }

Deploy

IAM role

The Lambda execution role needs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
      "Resource": "arn:aws:s3:::your-checkpoint-bucket/*"
    }
  ]
}

Lambda layer

Bundle yoink and its [s3] extra into a layer:

mkdir -p layer/python
pip install --target layer/python "yoink[s3]"
cd layer && zip -r ../yoink-layer.zip python && cd ..
aws lambda publish-layer-version \
  --layer-name yoink \
  --zip-file fileb://yoink-layer.zip \
  --compatible-runtimes python3.11

Function

aws lambda create-function \
  --function-name yoink-crawler \
  --runtime python3.11 \
  --role arn:aws:iam::ACCOUNT:role/yoink-crawler-role \
  --handler handler.handler \
  --timeout 900 \
  --memory-size 1024 \
  --layers arn:aws:lambda:REGION:ACCOUNT:layer:yoink:1 \
  --zip-file fileb://handler.zip \
  --environment "Variables={CHECKPOINT_BUCKET=...,CHECKPOINT_KEY=crawls/example.jsonl,START_URL=https://example.com}"

Schedule

aws events put-rule \
  --name yoink-crawler-tick \
  --schedule-expression "rate(14 minutes)"
 
aws events put-targets \
  --rule yoink-crawler-tick \
  --targets "Id=1,Arn=arn:aws:lambda:REGION:ACCOUNT:function:yoink-crawler"
 
aws lambda add-permission \
  --function-name yoink-crawler \
  --statement-id allow-eventbridge \
  --action lambda:InvokeFunction \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:REGION:ACCOUNT:rule/yoink-crawler-tick

Observability

A few things worth logging:

import structlog
log = structlog.get_logger()
 
# in handler:
log.info("invocation_complete",
    pages_so_far=len(pages),
    done=done,
    checkpoint=checkpoint_uri,
)

You can read the checkpoint file from anywhere with read access — aws s3 cp, the AWS console, or a small Lambda that loads it via CheckpointManager.from_uri(...).load().

Stopping the schedule

When done=True, disable the EventBridge rule (or have the Lambda do it):

import boto3
 
if done:
    boto3.client("events").disable_rule(Name="yoink-crawler-tick")