Scan and count
Scan reads every item in a DynamoDB table. Use it when you need all items or don't know the partition key.
Tip
For large result sets, you might want to use as_dict=True. See as_dict.
Key features
- Scan all items in a table
- Filter results by any attribute
- Count items without returning them
- Parallel scan for large tables (4-8x faster)
- Automatic pagination
- Async support
Getting started
Basic scan
Use Model.scan() to read all items:
"""Basic scan example - scan all items in a table."""
from pydynox import DynamoDBClient, Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
client = DynamoDBClient()
class User(Model):
model_config = ModelConfig(table="users", client=client)
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
# Scan all users
for user in User.scan():
print(f"{user.name} is {user.age} years old")
The scan returns a ModelScanResult that you can:
- Iterate with
forloop - Get first result with
.first() - Collect all with
list()
Filter conditions
Filter results by any attribute:
"""Scan with filter condition."""
from pydynox import DynamoDBClient, Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
client = DynamoDBClient()
class User(Model):
model_config = ModelConfig(table="users", client=client)
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
# Filter by status
for user in User.scan(filter_condition=User.status == "active"):
print(f"Active user: {user.name}")
# Filter by age
for user in User.scan(filter_condition=User.age >= 18):
print(f"Adult: {user.name}")
# Complex filter
for user in User.scan(filter_condition=(User.status == "active") & (User.age >= 21)):
print(f"Active adult: {user.name}")
Warning
Filters run after DynamoDB reads the items. You still pay for reading all items, even if the filter returns fewer.
Get first result
Get the first matching item:
"""Get first result from scan."""
from pydynox import DynamoDBClient, Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
client = DynamoDBClient()
class User(Model):
model_config = ModelConfig(table="users", client=client)
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
# Get first user (any user)
user = User.scan().first()
if user:
print(f"Found: {user.name}")
else:
print("No users found")
# Get first user matching filter
admin = User.scan(filter_condition=User.name == "admin").first()
if admin:
print(f"Admin found: {admin.pk}")
Count items
Count items without returning them:
"""Count items in a table."""
from pydynox import DynamoDBClient, Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
client = DynamoDBClient()
class User(Model):
model_config = ModelConfig(table="users", client=client)
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
# Count all users
count, metrics = User.count()
print(f"Total users: {count}")
print(f"Duration: {metrics.duration_ms:.2f}ms")
print(f"RCU consumed: {metrics.consumed_rcu}")
# Count with filter
active_count, _ = User.count(filter_condition=User.status == "active")
print(f"Active users: {active_count}")
# Count adults
adult_count, _ = User.count(filter_condition=User.age >= 18)
print(f"Adults: {adult_count}")
Note
Count still scans the entire table. It just doesn't return the items.
Advanced
Why scan is expensive
DynamoDB charges by read capacity units (RCU). Scan reads every item, so you pay for the entire table.
| Table size | Items | RCU (eventually consistent) | RCU (strongly consistent) |
|---|---|---|---|
| 100 MB | 10,000 | ~25,000 | ~50,000 |
| 1 GB | 100,000 | ~250,000 | ~500,000 |
| 10 GB | 1,000,000 | ~2,500,000 | ~5,000,000 |
Formula:
- Eventually consistent: 1 RCU = 4 KB
- Strongly consistent: 1 RCU = 2 KB (2x cost)
Parallel scan
For large tables, split the scan across multiple segments to speed it up. Parallel scan runs all segments concurrently using tokio in Rust.
Performance: 4 segments = ~4x faster, 8 segments = ~8x faster. RCU cost is the same (you're reading the same data).
"""Parallel scan example - scan large tables fast."""
from pydynox import Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
class User(Model):
"""User model."""
model_config = ModelConfig(table="users")
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
# Parallel scan with 4 segments - much faster for large tables
users, metrics = User.parallel_scan(total_segments=4)
print(f"Found {len(users)} users in {metrics.duration_ms:.2f}ms")
# With filter
active_users, metrics = User.parallel_scan(
total_segments=4, filter_condition=User.status == "active"
)
print(f"Found {len(active_users)} active users")
Async version:
"""Async parallel scan example."""
import asyncio
from pydynox import Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
class User(Model):
"""User model."""
model_config = ModelConfig(table="users")
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
async def main():
"""Async parallel scan."""
# Parallel scan with 4 segments
users, metrics = await User.async_parallel_scan(total_segments=4)
print(f"Found {len(users)} users in {metrics.duration_ms:.2f}ms")
# With filter
active_users, metrics = await User.async_parallel_scan(
total_segments=4, filter_condition=User.status == "active"
)
print(f"Found {len(active_users)} active users")
asyncio.run(main())
How many segments?
- Small tables (< 100K items): 1-2 segments
- Medium tables (100K - 1M items): 4-8 segments
- Large tables (> 1M items): 8-16 segments
Experiment to find what works best for your table size.
Important: Parallel scan returns all items at once (not paginated). For very large tables that don't fit in memory, use regular scan() with segments for streaming:
Async scan
Use async_scan() and async_count() for async code:
"""Async scan example."""
import asyncio
from pydynox import DynamoDBClient, Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
client = DynamoDBClient()
class User(Model):
model_config = ModelConfig(table="users", client=client)
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
async def scan_all_users() -> None:
"""Scan all users asynchronously."""
async for user in User.async_scan():
print(f"{user.name}")
async def scan_active_users() -> None:
"""Scan with filter asynchronously."""
async for user in User.async_scan(filter_condition=User.status == "active"):
print(f"Active: {user.name}")
async def count_users() -> None:
"""Count users asynchronously."""
count, metrics = await User.async_count()
print(f"Total: {count}, Duration: {metrics.duration_ms:.2f}ms")
async def main() -> None:
await scan_all_users()
await scan_active_users()
await count_users()
if __name__ == "__main__":
asyncio.run(main())
Async parallel scan:
"""Async parallel scan example."""
import asyncio
from pydynox import Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
class User(Model):
"""User model."""
model_config = ModelConfig(table="users")
pk = StringAttribute(hash_key=True)
name = StringAttribute()
age = NumberAttribute()
status = StringAttribute()
async def main():
"""Async parallel scan."""
# Parallel scan with 4 segments
users, metrics = await User.async_parallel_scan(total_segments=4)
print(f"Found {len(users)} users in {metrics.duration_ms:.2f}ms")
# With filter
active_users, metrics = await User.async_parallel_scan(
total_segments=4, filter_condition=User.status == "active"
)
print(f"Found {len(active_users)} active users")
asyncio.run(main())
Pagination
By default, the iterator fetches all pages automatically. For manual control:
result = User.scan(limit=100)
users = list(result)
# Get the last key for next page
last_key = result.last_evaluated_key
if last_key:
next_result = User.scan(limit=100, last_evaluated_key=last_key)
Consistent reads
For strongly consistent reads:
Metrics
Access scan metrics after iteration:
result = User.scan()
users = list(result)
print(f"Duration: {result.metrics.duration_ms}ms")
print(f"Items returned: {result.metrics.items_count}")
print(f"Items scanned: {result.metrics.scanned_count}")
print(f"RCU consumed: {result.metrics.consumed_rcu}")
Return dicts instead of models
By default, scan returns Model instances. Each item from DynamoDB is converted to a Python object with all the Model methods and hooks.
This conversion has a cost. Python object creation is slow compared to Rust. For scans that return many items (hundreds or thousands), this becomes a bottleneck.
Use as_dict=True to skip Model instantiation and get plain dicts:
"""Scan returning dicts instead of Model instances."""
from pydynox import Model, ModelConfig
from pydynox.attributes import NumberAttribute, StringAttribute
class User(Model):
model_config = ModelConfig(table="users")
pk = StringAttribute(hash_key=True)
sk = StringAttribute(range_key=True)
name = StringAttribute(null=True)
age = NumberAttribute(null=True)
# Return dicts instead of Model instances
for user in User.scan(as_dict=True):
# user is a plain dict, not a User instance
print(user.get("pk"), user.get("name"))
# Parallel scan with as_dict
users, metrics = User.parallel_scan(total_segments=4, as_dict=True)
print(f"Found {len(users)} users as dicts")
When to use as_dict=True:
- Read-only operations where you don't need
.save(),.delete(), or hooks - Scans returning many items (100+)
- Performance-critical code paths
- Data export or migration scripts
Trade-offs:
| Model instances | as_dict=True |
|
|---|---|---|
| Speed | Slower (Python object creation) | Faster (plain dicts) |
| Methods | .save(), .delete(), .update() |
None |
| Hooks | after_load runs |
No hooks |
| Type hints | Full IDE support | Dict access |
| Validation | Attribute types enforced | Raw DynamoDB types |
Why this happens
This is how Python works. Creating class instances is expensive. Rust handles the DynamoDB call and deserialization fast, but Python must create each Model object. There's no way around this in Python itself.
Scan parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
filter_condition |
Condition | None | Filter on any attribute |
limit |
int | None | Items per page |
consistent_read |
bool | None | Strongly consistent read |
last_evaluated_key |
dict | None | Start key for pagination |
segment |
int | None | Segment number for parallel scan |
total_segments |
int | None | Total segments for parallel scan |
as_dict |
bool | False | Return dicts instead of Model instances |
Parallel scan parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
total_segments |
int | Required | Number of parallel segments |
filter_condition |
Condition | None | Filter on any attribute |
consistent_read |
bool | None | Strongly consistent read |
as_dict |
bool | False | Return dicts instead of Model instances |
Count parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
filter_condition |
Condition | None | Filter on any attribute |
consistent_read |
bool | None | Strongly consistent read |
Anti-patterns
Scan in API endpoints
# Bad: slow and expensive on every request
@app.get("/users")
def list_users():
return list(User.scan())
Use query with a GSI or pagination instead.
Scan to find one item
# Bad: scanning to find a single user by email
user = User.scan(filter_condition=User.email == "john@example.com").first()
Create a GSI on email and use query:
Expecting filters to reduce cost
# Bad: this still reads all 1 million users
active_users = list(User.scan(filter_condition=User.status == "active"))
Use a GSI on status or a different data model.
Frequent count operations
# Bad: counting on every page load
@app.get("/dashboard")
def dashboard():
total_users, _ = User.count()
return {"total": total_users}
Maintain a counter in a separate item or use CloudWatch metrics.
Scan vs query
| Scan | Query | |
|---|---|---|
| Reads | Entire table | Items with same partition key |
| Cost | High (all items) | Low (only matching items) |
| Speed | Slow on large tables | Fast |
| Use case | Export, migration, admin | User-facing, real-time |
If you can use query, use query. Only use scan when you need all items or don't know the partition key.
Alternatives to scan
| Need | Alternative |
|---|---|
| Find by non-key attribute | Create a GSI |
| Count items | Maintain a counter item |
| Search text | Use OpenSearch or Algolia |
| List recent items | GSI with timestamp as sort key |
| Export data | DynamoDB Export to S3 |
Next steps
- Query - Query by partition key
- Indexes - Query by non-key attributes
- Conditions - All condition operators