Skip to content

Accounting service draft#

Requirements#

Functional requirements#

  • Users should be able to create and top-up their (funding) accounts.
  • Subscription management for users.
  • Funds management across v-lab/projects.
  • Resource usage collection, user cost calculation and billing.
  • Job termination when funds are exhausted.
  • Users should have access to cost breakdown at both virtual lab and project level.
  • Users should be able to see account balances, including reserved costs for currently running jobs.
  • Admins should be able to control service prices/margins.
  • Admins should be able to trace all transactions, see detailed and aggregated costs.

Non-functional requirements#

  • High availability with minimal downtime for critical services.
  • Double-spending/overspending prevention.
  • Frequent user balance updates.

High level architecture / data flow#

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%

graph
  subgraph accountingSvc[Accounting service]
    sqs["AWS SQS (FIFO Queue)"]
    db[(AWS RDS)]
    accountingAPI[Accounting API]
    usageReceiverAPI[Event receiver API]
    paymentReceiverAPI[Payment receiver API]
    ledger[Ledger]
    pricing[Pricing]

    subgraph tasks[Tasks]
      jobCharger[Job Charger]
      eventProcessor[Event processor]
      watchdog[Watchdog]
      awsCostTracker[AWS cost tracker]
    end
  end

  subgraph svc[Services]
    subgraph longrunSvc[Longrun services]
      singleCellSim[Single cell simulations]
      meModel[ME-model simulations]
      smallCircuitSim[Small circuit simulations]
    end

    subgraph oneshotSvc[Oneshot services]
      ml[ML]
    end

    subgraph storageSvc[Storage report service]
      nexus[Nexus storage]
    end
  end

  virtualLabAPI[Virtual Lab API]
  costExplorerAPI[AWS Cost Explorer API]

  svc -- usage events ---> usageReceiverAPI
  usageReceiverAPI -- usage events ---> sqs

  longrunSvc -- pre-run cost reservation --> accountingAPI
  jobCharger -- job termination --> longrunSvc

  oneshotSvc -- pre-run cost reservation --> accountingAPI

  paymentReceiverAPI -- top-up events --> sqs

  sqs -- events --> eventProcessor
  eventProcessor -- usage reports --> db

  pricing --> db

  virtualLabAPI <--> accountingAPI

  ledger ---> db
  jobCharger ---> db
  jobCharger --- ledger
  jobCharger --- pricing

  watchdog --> db
  watchdog --- ledger

  costExplorerAPI -- AWS costs --> awsCostTracker
  awsCostTracker <--> db

  %% styles
  style svc fill:transparent

  style sqs fill:#FF9900
  style costExplorerAPI fill:#FF9900
  style db fill:#FF9900


  style paymentReceiverAPI fill:#87CEEB
  style usageReceiverAPI fill:#87CEEB
  style accountingAPI fill:#87CEEB
  style virtualLabAPI fill:#87CEEB

General idea#

The accounting service is meant to collect usage statistics from compute/storage services and to translate it into the user cost, which means:

  • Updating virtual-lab/project budgets.
  • Providing the user with detailed information about the usage and the cost of the resources spent.

Compute services#

Will handle two types of jobs by their billing model:

  • longrun, which are billed by running time and used resources (e.g. underlying EC2 instance type, number of nodes). Examples: model building, simulations, analysis.
  • oneshot, which are billed per execution with a fixed cost. Example: ML API calls.

These services will be responsible for:

  • Executing pre-run checks to estimate the cost and reserve user funds for each execution.
  • Providing usage statistics in a form of events which mark the start and the end of usage sessions as well as hearbeat signals.
  • Listening for job termination requests from the accounting service (longrun jobs only).

Storage report service#

Is responsible to periodically collect the usage statistics for shared S3 buckets per each virtual-lab/project and provide it to the accounting service.

Accounting service#

Ledger accessor#

It is one of the core components of the accounting service, it's role is to track, manage and report funds across various system and user (v-lab/project) accounts. Uses double entry accounting model.

The minimum list of accounts that are required for the service:

  • User related accounts:
    • Main account per each virtual lab. This is a target for top-ups.
    • Main account per each project.
    • Reservation account per each project.
  • System accounts:
    • Main platform account

Later on more accounts can be added, for example to track real AWS costs or other expenses.

SQL transactions must be used where applicable for concurrency control as well as to ensure data consistency and integrity.

Event processor#

It's responsible to collect from the AWS Simple Queue Service (SQS) and initiate processing for:

  • Usage events from:
    • The compute service.
    • The storage report service.
  • Top-up events from the payment provider.

Job charger#

A separate process that periodically charges users for their used storage and running jobs.

Cost component#

The component handles the following tasks:
1) Calculates user costs based on the usage reports by applying cost coefficient and fixed costs.
2) Handles cost estimation and reservation before any task is actually started (needed for double spent prevention).

AWS cost tracker#

Will provide insights about user billed costs and underlying platform AWS costs by mapping usage reports to data from AWS Cost Explorer API.
Can be used by platform administrators for analysis and decision making on service costs.

Watchdog#

It's responsible for:

  • detecting abnormal termination of longrun jobs, i.e. when some time has passed without receiving any running or finish event.
    In this case, the job should be marked as terminated in the db, and the job charger should charge the user for the partial cost.
  • detecting reserved job never started, i.e. when some time has passed without receiving any start event.
    In this case, the job should be marked as cancelled, and the job charger should release the reservation.

Business logic#

The process diagrams in this section are color-coded according the the job/event status as follows:

graph
  started[Started]
  running[Running]
  finished[Finished]

  style started fill:#FFFACD
  style running fill:#90EE90
  style finished fill:#ADD8E6

Longrun jobs#

Sequence diagram#

sequenceDiagram
  participant svc as Compute service

  box Accounting
    participant api as API
    participant cost as Cost
    participant ledger as Ledger
    participant queue as Queue
    participant jobCharger as Job Charger
    participant watchdog as Watchdog
  end

  svc ->> api: Request job reservation
  api ->> cost: Get estimated cost
  cost ->> api: Estimated cost

  api ->> ledger: Reserve estimated cost
  ledger ->> api: Reservation result

  break if not enough funds
    api ->> svc: Rejected
  end

  api ->> svc: Allowed

  Note over svc: Job is starting

  break if no events received
    watchdog ->> jobCharger: Cancel job
    jobCharger ->> ledger: Release reservation
    jobCharger ->> svc: Terminate
  end

  rect rgb(233, 247, 248)
    svc --) queue: Started evt

    loop
      par
        svc --) queue: Running evt

      and
        jobCharger ->> cost: Compute cost for last interval
        cost ->> jobCharger: Cost
        jobCharger ->> ledger: Charge reservation + project

        opt if no funds left
          jobCharger ->> svc: Terminate
        end
      end
    end

    break if no events received
      watchdog ->> jobCharger: Terminate job
      jobCharger ->> ledger: Return unspent reservation
      jobCharger ->> svc: Terminate
    end

    Note over svc: Job finishes

    svc --) queue: Finished evt
  end

  jobCharger ->> ledger: Return unspent reservation

Process diagrams#

Job execution request HTTP API calls#
graph LR
  start([Start])
  End([End])

  receive[/HTTP API:<br/>Receive job reservation request/]
  estimateCost[Cost service:<br/>Estimate job cost]
  checkAvailableFunds{Ledger:<br/>enough funds}
  createUsageReport[DB:<br/>Create reserved job]
  reserve[Ledger:<br/>reserve estimated cost]

  reject[Reject job exec]
  allow[Allow job exec]

  start --> receive
  receive --> estimateCost
  estimateCost --> checkAvailableFunds

  checkAvailableFunds --> |No| reject

  checkAvailableFunds --> |Yes| createUsageReport
  createUsageReport --> reserve
  reserve --> allow
  allow --> End

  reject --> End

  %% styles
  style estimateCost fill:#FFFACD
  style checkAvailableFunds fill:#FFFACD
  style createUsageReport fill:#FFFACD
  style reserve fill:#FFFACD
  style allow fill:#FFFACD
  style reject fill:#FFFACD
SQS events#
flowchart
  start([Start event processing loop])
  getEvent[Queue:<br/>Get event]

  checkStatus{Event status}

  initUsageReport[DB:<br/>Update reserved job with<br/>started_at and<br/>last_alive_at]
  updateFinishedAt[DB:<br/>Update finished_at<br/>and last_alive_at]
  updateLastAliveAt[DB:<br/>Update last_alive_at]
  deleteEvent[Delete event<br/>from the queue]

  start --> getEvent
  getEvent --> checkStatus

  checkStatus --> |"Started"| initUsageReport
  initUsageReport --> deleteEvent

  checkStatus --> |"Running"| updateLastAliveAt
  updateLastAliveAt --> deleteEvent

  checkStatus --> |Finished| updateFinishedAt
  updateFinishedAt --> deleteEvent

  deleteEvent --> getEvent

  %% styles

  %% Started
  style initUsageReport fill:#FFFACD

  %% Running
  style updateLastAliveAt fill:#90EE90

  %% Finished
  style updateFinishedAt fill:#ADD8E6
Periodic charging#
graph
  start([Start periodic charge processing])
  getJobToBeCharged[DB:<br/>Get longrun job to be charged]
  sleep[Sleep]

  compute_unfinished_uncharged[Cost service:<br/>compute fixed cost and<br/>cost for first running time]
  compute_unfinished_charged[Cost service:<br/>compute cost for running time<br/>since the last charge]
  compute_finished_uncharged[Cost service:<br/>compute fixed costs and cost<br/>for the full running time]
  compute_finished_charged[Cost service:<br/>compute cost for<br/>the final running time]
  compute_finished_overcharged[Cost service:<br/>compute cost for extra time<br/>previously overcharged]

  balance_unfinished{Available funds on<br/>reservation + proj?}
  charge_unfinished[Ledger:<br/>Charge<br/>reservation + proj]
  drain_unfinished[Ledger:<br/>Drain<br/>reservation + proj]
  terminateJob[Compute svc:<br/>Terminate job]

  balance_finished{Available funds on<br/>reservation + proj?}
  charge_finished[Ledger:<br/>Charge<br/>reservation + proj]
  drain_finished[Ledger:<br/>Drain<br/>reservation + proj]

  releaseUnspentReservation[Ledger:<br/>Release unspent<br/>reservation]

  refund_overcharged["Ledger:<br/>Refund extra time<br/>previously overcharged"]

  start --> getJobToBeCharged
  getJobToBeCharged --> |job not finished<br/>not charged| compute_unfinished_uncharged
  getJobToBeCharged --> |job not finished<br/>partially charged| compute_unfinished_charged
  getJobToBeCharged --> |job finished<br/>not charged| compute_finished_uncharged
  getJobToBeCharged --> |job finished<br/>partially charged| compute_finished_charged
  getJobToBeCharged --> |job finished<br/>overcharged<br/>*edge case*| compute_finished_overcharged

  compute_unfinished_uncharged --> balance_unfinished
  compute_unfinished_charged --> balance_unfinished
  compute_finished_uncharged --> balance_finished
  compute_finished_charged --> balance_finished
  compute_finished_overcharged --> refund_overcharged --> sleep

  balance_finished --> |< txn| drain_finished --> sleep
  balance_finished --> |>= txn| charge_finished --> releaseUnspentReservation --> sleep

  balance_unfinished --> |< txn| drain_unfinished --> terminateJob --> sleep
  balance_unfinished --> |>= txn| charge_unfinished --> sleep

  sleep --> getJobToBeCharged

  %% styles
  style compute_unfinished_uncharged fill:#90EE90
  style compute_unfinished_charged fill:#90EE90
  style balance_unfinished fill:#90EE90
  style drain_unfinished fill:#90EE90
  style charge_unfinished fill:#90EE90
  style terminateJob fill:#90EE90

  style compute_finished_uncharged fill:#ADD8E6
  style compute_finished_charged fill:#ADD8E6
  style balance_finished fill:#ADD8E6
  style drain_finished fill:#ADD8E6
  style charge_finished fill:#ADD8E6
  style releaseUnspentReservation fill:#ADD8E6

  style compute_finished_overcharged fill:#ADD8E6
  style refund_overcharged fill:#ADD8E6

Oneshot jobs#

Sequence diagram#

sequenceDiagram
  participant svc as Compute service

  box Accounting
    participant api as API
    participant cost as Cost
    participant ledger as Ledger
    participant queue as Queue
    participant jobCharger as Job Charger
    participant watchdog as Watchdog
  end

  svc ->> api: Request job reservation
  api ->> cost: Get estimated cost
  cost ->> api: Estimated cost
  api ->> ledger: Reserve estimated cost
  ledger ->> api: Reservation result

  break if not enough funds
    api ->> svc: Rejected
  end

  api ->> svc: Allowed

  rect rgb(233, 247, 248)
    Note over svc: Job starts

    break if no events received
      watchdog ->> jobCharger: Cancel job
      jobCharger ->> ledger: Release reservation
    end

    Note over svc: Job finishes
  end

  svc --) queue: Usage evt
  jobCharger ->> ledger: Charge reservation

Process diagrams#

Job execution request HTTP API calls#
graph LR
  start([Start])
  End([End])

  receive[/HTTP API:<br/>Receive job reservation request/]

  computeCost[Cost service:<br/>Compute job cost]
  checkAvailableFunds{Ledger:<br/>enough funds}
  createUsageReport[DB:<br/>Create reserved job]
  reserve[Ledger:<br/>reserve estimated cost]

  reject[Reject job exec]
  allow[Allow job exec]

  start --> receive
  receive --> computeCost
  computeCost --> checkAvailableFunds

  checkAvailableFunds --> |No| reject

  checkAvailableFunds --> |Yes| createUsageReport
  createUsageReport --> reserve
  reserve --> allow
  allow --> End

  reject --> End

  style computeCost fill:#FFFACD
  style checkAvailableFunds fill:#FFFACD
  style createUsageReport fill:#FFFACD
  style reserve fill:#FFFACD
  style reject fill:#FFFACD
  style allow fill:#FFFACD
SQS events#
graph LR
  %% event processing

  start([Start event processing loop])

  getEvent[Queue:<br/>Get event]
  closeUsageReport[DB:<br/>Update started_at,<br/>last_alive_at, finished_at]

  deleteEvent[Queue:
    Delete event]

  start --> getEvent
  getEvent --> closeUsageReport
  closeUsageReport --> deleteEvent
  deleteEvent --> getEvent

  %% styles
  %% Running
  style closeUsageReport fill:#ADD8E6
Periodic charging#

The oneshot jobs are charged by the job charger in a separate task so that the event processor doesn't need to handle money transactions.

graph LR
  start([Start periodic charge processing])
  getJobToBeCharged[DB:<br/>Get job to be charged]
  sleep[Sleep]

  computeTxnAmount[Cost service:<br/>compute txn amount]

  charge[Ledger:<br/>Charge<br/>reservation]

  start --> getJobToBeCharged
  getJobToBeCharged --> computeTxnAmount
  sleep --> getJobToBeCharged

  computeTxnAmount --> charge
  charge --> sleep

  %% styles
  style computeTxnAmount fill:#ADD8E6
  style charge fill:#ADD8E6

Storage#

Sequence diagram#

sequenceDiagram
  participant svc as Storage stats service

  box Accounting
    participant queue as Queue
    participant cost as Cost
    participant ledger as Ledger
    participant jobCharger as Job Charger
  end

  rect rgb(233, 247, 248)
    loop
      par
        svc --) queue: Usage evt
      and
        jobCharger ->> cost: Compute cost for last interval
        cost ->> jobCharger: Cost
        jobCharger ->> ledger: Charge main
      end
    end
  end

Process diagrams#

SQS Events#

graph LR
  %% event processing

  start([Start event processing loop])
  getEvent[Queue:<br/>Get event]
  createUsageReport[DB:<br/>Create new<br/>usage report]
  closePreviousUsageReport[DB:<br/>Close previous<br/>storage usage report]

  deleteEvent[Queue:<br/>Delete event]

  start --> getEvent
  getEvent --> closePreviousUsageReport
  closePreviousUsageReport --> createUsageReport
  createUsageReport --> deleteEvent
  deleteEvent --> getEvent

  %% styles
  %% Running
  style createUsageReport fill:#90EE90
  style closePreviousUsageReport fill:#90EE90
Periodic charging#
graph

  start([Start periodic charge processing])

  getStorageReportsToBeCharged["DB:<br/>Get storage reports<br/>not charged/partially charged"]

  computeTxnAmount[Cost service:<br/>compute runnning txn amount]
  chargeMain[Ledger:<br/>Charge project]

  computeFinalTxnCost[Cost service:<br/>Compute closing txn cost]
  chargeDeltaFromMain[Ledger:<br/>Create closing txn]

  sleep[Sleep]

  start --> getStorageReportsToBeCharged
  getStorageReportsToBeCharged --> |not finished| computeTxnAmount --> chargeMain --> sleep
  getStorageReportsToBeCharged --> |finished| computeFinalTxnCost --> chargeDeltaFromMain --> sleep
  sleep --> getStorageReportsToBeCharged

  %% styles
  %% Running
  style computeTxnAmount fill:#90EE90
  style chargeMain fill:#90EE90

  %% Finished
  style chargeDeltaFromMain fill:#ADD8E6
  style computeFinalTxnCost fill:#ADD8E6

SQS event format#


erDiagram
  StorageUsageEvent {
    string type "'storage'"
    uuid vlab_id
    uuid proj_id
    bigint size
    timestamp timestamp
  }

  LongrunJobUsageEvent {
    string type "'longrun'"
    string subtype "'single-cell-sim'"
    enum status "'started' | 'running' | 'finished'"
    uuid vlab_id
    uuid proj_id
    uuid job_id
    int instances
    string instance_type
    timestamp timestamp
  }

  OneshotJobUsageEvent {
    string type "'oneshot'"
    string subtype "'ml-query'"
    uuid vlab_id
    uuid proj_id
    uuid job_id
    int count
    timestamp timestamp
  }

  TopUpEvent["TopUpEvent TBD"] {
    enum type "'top-up'"
    uuid vlab_id
    uuid proj_id
    string amount "To check format from Stripe"
    timestamp timestamp
  }

Notes:
- All the payloads must be json dicts with keys and values formatted as strings.
- The timestamps must be represented as unix time in milliseconds.
- The separator and case used for type and subtype should be always the same for consistency:
- use lowercase names
- use - instead of _
- prefer single to plural names
- The list of valid subtypes needs to be decided yet, and it could evolve in the future.

DB schemas#

erDiagram

  ACCOUNT {
    uuid id PK
    enum account_type "'sys', 'vlab', 'proj', 'rsv'"
    uuid parent_id FK
    string name
    decimal balance
    bool enabled
  }

  JOURNAL {
    bigint id PK
    datetime transaction_datetime
    enum transaction_type "'top-up', 'assign', 'reserve', 'release', 'charge-oneshot', 'charge-longrun', 'charge-storage', 'refund'"
    uuid job_id FK
    int price_id FK
    dict properties "'reason', 'charge_period_start', 'charge_period_end'..."
  }

  LEDGER {
    bigint id PK
    uuid account_id FK
    bigint journal_id FK
    Decimal amount
  }

  JOB {
    uuid id PK
    uuid vlab_id FK
    uuid proj_id FK
    enum service_type "'storage', 'longrun', 'oneshot'"
    enum service_subtype
    int usage_value
    datetime reserved_at
    datetime started_at
    datetime last_alive_at
    datetime last_charged_at
    datetime finished_at
    datetime cancelled_at
    dict properties "['instances', 'instance_type'...]"
  }

  PRICE {
    int id PK
    enum service_type
    enum service_subtype
    datetime valid_from
    datetime valid_to
    decimal multiplier
    decimal fixed_cost
    uuid vlab_id FK
  }

  JOB ||--|{ JOURNAL : "Triggers"
  JOURNAL ||--|{ LEDGER : "Is detailed by"
  ACCOUNT ||--|{ LEDGER : "Has"
  ACCOUNT ||--o| ACCOUNT : "Has parent in"
  PRICE ||--|{ JOURNAL : "Is used by"
  ACCOUNT ||--|{ PRICE : "Is charged with"
  ACCOUNT ||--|{ JOB : "Started"