Modern Python ORM for data science

SQLAlchemy is the leading Python SQL toolkit and Object-Relational Mapping (ORM) framework, essential for modern data science workflows. This guide demonstrates how to leverage SQLAlchemy's powerful features in Deepnote's collaborative environment for efficient database operations and data analysis.

Why choose SQLAlchemy for data science?

Before diving into implementation, let's understand why SQLAlchemy has become the go-to choice for data professionals:

Type safety: Strong typing support with Python type hints
Performance: Optimized database operations with intelligent query generation
Flexibility: Works with multiple databases (PostgreSQL, MySQL, SQLite, etc.)
Integration: Seamless compatibility with pandas, numpy, and other data science tools
Modern design: Built for contemporary Python practices and async operations

Quick start with SQLAlchemy in Deepnote

# Install the latest SQLAlchemy
!pip install sqlalchemy>=2.0.0

# Essential imports
from sqlalchemy import create_engine, select
from sqlalchemy.orm import DeclarativeBase, Session
from sqlalchemy.orm import Mapped, mapped_column

# For SQLite (Development)
engine = create_engine("sqlite:///analytics_db.db", echo=True)

# For PostgreSQL (Production)
engine = create_engine("postgresql://user:password@localhost:5432/dbname")

Modern model definition

class Base(DeclarativeBase):
    pass

class DataPoint(Base):
    __tablename__ = "data_points"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    timestamp: Mapped[datetime] = mapped_column(DateTime)
    value: Mapped[float] = mapped_column(Float)
    category: Mapped[str] = mapped_column(String(50))
    
    def __repr__(self) -> str:
        return f"DataPoint(id={self.id}, timestamp={self.timestamp}, value={self.value})"

Advanced SQLAlchemy features for data science

Efficient batch operations

def batch_insert_data(data_points: list[dict], batch_size: int = 1000):
     with Session(engine) as session:
         for i in range(0, len(data_points), batch_size):
             batch = [DataPoint(**point) for point in data_points[i:i + batch_size]]
             session.add_all(batch)
             session.commit()

Integration with Pandas

import pandas as pd

def query_to_dataframe(query_statement) -> pd.DataFrame:
    with Session(engine) as session:
        result = session.execute(query_statement)
        df = pd.DataFrame(result.fetchall())
        if df.empty:
            return pd.DataFrame()
        df.columns = result.keys()
        return df

# Example usage
query = select(DataPoint).where(DataPoint.value > 100)
df = query_to_dataframe(query)

Advanced querying techniques

from sqlalchemy import func, and_, or_

def analyze_data_patterns():
     with Session(engine) as session:
         # Complex aggregation example
         stats = session.execute(
             select(
                 DataPoint.category,
                 func.count(DataPoint.id).label('count'),
                 func.avg(DataPoint.value).label('avg_value'),
                 func.stddev(DataPoint.value).label('std_value')
             ).group_by(DataPoint.category)
         ).all()
         return stats

Best practices for production

Connection management

from contextlib import contextmanager

@contextmanager
def get_session():
     session = Session(engine)
     try:
         yield session
         session.commit()
     except Exception:
         session.rollback()
         raise
     finally:
         session.close()

Error handling and validation

from sqlalchemy.exc import SQLAlchemyError
from typing import Optional

def safe_query_execution(query_func) -> Optional[pd.DataFrame]:
     try:
         with get_session() as session:
             result = query_func(session)
             return pd.DataFrame(result)
     except SQLAlchemyError as e:
         print(f"Database error occurred: {str(e)}")
         return None

Real-world examples in Deepnote

Time series analysis

def analyze_time_series():
     query = select(
         func.date_trunc('hour', DataPoint.timestamp).label('hour'),
         func.avg(DataPoint.value).label('avg_value')
     ).group_by(text('hour')).order_by(text('hour'))
     
     return query_to_dataframe(query)

Statistical analysis

def calculate_statistics():
     with get_session() as session:
         result = session.execute(
             select(
                 func.percentile_cont(0.5).within_group(
                     DataPoint.value.desc()
                 ).label('median'),
                 func.avg(DataPoint.value).label('mean'),
                 func.stddev(DataPoint.value).label('std_dev')
             )
         ).first()
         return result

Performance optimization tips

Bulk operations: Use session.bulk_insert_mappings() for large datasets
Lazy loading: Configure relationship loading strategies appropriately
Index management: Create indexes for frequently queried columns
Connection pooling: Utilize connection pools for concurrent access

from sqlalchemy.pool import QueuePool

engine = create_engine(
     "postgresql://user:password@localhost:5432/dbname",
     poolclass=QueuePool,
     pool_size=5,
     max_overflow=10
)

SQLAlchemy 2.0 provides a robust foundation for data science workflows in Python. By combining its powerful ORM capabilities with Deepnote's collaborative environment, you can create efficient, maintainable, and scalable data analysis pipelines. This guide covered essential concepts and best practices, but SQLAlchemy offers many more features to explore.

Further Reading

Remember to optimize your database queries, implement proper error handling, and follow Python best practices when building production-ready applications with SQLAlchemy.