From Android platform fundamentals to pure-Kotlin fisheye inverse perspective mapping — a complete engineering walkthrough.
Welcome. This covers the full stack — from what Android is, through AAOS, to how a pure-geometry bird's-eye view is computed on embedded Android hardware.
Android is an open-source operating system based on the Linux kernel, developed by Google. It provides a complete software stack: kernel, middleware, runtime, and key applications. Released in 2008, it now powers over 3 billion active devices worldwide — phones, tablets, TVs, watches, cars, and embedded boards.
Open Source — AOSP is publicly available and forkable by any manufacturer
Multi-form factor — phones, tablets, TVs, watches, cars, embedded boards
APK packaging — one binary runs on any compatible Android device
Kotlin-first since 2019 — Google's preferred language for Android
Android 15 (API 35) is the current stable release. This project targets API 23 (Android 6.0) for broad compatibility including the Jetson Nano running LineageOS.
Android is not just a phone OS — it's a full software platform. The Jetson Nano running LineageOS proves this: same APK, same ART runtime, just different hardware underneath.
The most important file in any Android app. Declares the app to the OS: which activities exist, what permissions are needed, which hardware features are required or optional, and which intent filters make the app launchable. Without a valid manifest the OS will not install the APK.
An Activity represents one screen. It has a lifecycle managed by the OS: onCreate → onStart → onResume → onPause → onStop → onDestroy. Our app has one Activity that owns the entire UI and coordinates all computation.
Static files bundled inside the APK. Accessed via context.assets.open(). Our fisheye camera PNG images and the ground-truth BEV image live here as bev_images/{cam}/0.png.
Compose replaces the old XML layout system. UI is declared as Kotlin functions annotated @Composable. State variables using mutableStateOf() automatically trigger UI redraws when changed. We use TV Material3 which adds focus-ring navigation for remote control.
Gradle manages compilation, dependency downloading, and APK packaging. build.gradle.kts declares the SDK version, Compose and TV Material3 library versions, and signing config. The libs.versions.toml version catalog centralises all dependency versions.
Dispatchers.Default runs CPU-heavy work on a background thread pool. Dispatchers.Main runs UI updates on the main thread. withContext() switches between them safely. BEV computation (3–5 seconds) must run on Default — without this Android shows an ANR dialog after 5 seconds.
Every Android app has this skeleton. The threading model is the most critical Android-specific engineering decision in this project.
Android Automotive OS is a version of Android built directly into the vehicle's infotainment system — not a phone projection like Android Auto. It runs permanently on the car's own ECU hardware whether or not any phone is connected. First production vehicle: Volvo XC40 Recharge (2020).
Vehicle HAL (VHAL) — direct access to CAN bus, vehicle speed, gear, door status, hundreds of vehicle properties
CarService API — controls HVAC zones, audio focus, driver monitoring, instrument cluster
EVS (Exterior View System) — low-latency hardware camera access for surround-view, bypassing Camera2 API
Always-on architecture — boots with ignition, no user unlock, persistent services
Multi-display — native support for cluster + IVI + rear seat as separate zones
Functional Safety — ISO 26262 ASIL integration hooks
Our BEV app targets Android TV on Jetson Nano — a stepping stone. The same Kotlin code, Compose UI, and BevProcessor pipeline could run on an AAOS head unit with two main changes: replace asset-loaded PNGs with live EVS camera feeds, and replace TV Material3 navigation with CarUI library patterns. The BevProcessor geometry is fully platform-agnostic.
The most important AAOS difference for our project is EVS — instead of loading images from assets, we would get direct hardware camera frames at low latency. That single change transforms this prototype into a production system.
| Aspect | Standard Android | AAOS | Android TV (our target) |
|---|---|---|---|
| Hardware | Phone / Tablet | In-vehicle ECU / SoC | TV / Embedded board (Jetson Nano) |
| Input method | Touchscreen, gesture | Rotary knob, touch, voice | Remote control D-pad focus navigation |
| Boot trigger | Power button | Ignition / CAN bus event | Power supply connected |
| Camera access | Camera2 API | VHAL + EVS (Exterior View System) | Camera2 API or assets (our case) |
| UI framework | Compose / Views | Compose + CarUI library | Compose + TV Material3 |
| Multi-display | Limited | Native — cluster + IVI + rear | Single display |
| Safety standard | None | ISO 26262 ASIL hooks | None |
| App distribution | Play Store | Play Store for Cars (restricted) | Play Store for TV |
| Always-on | No | Yes — with ignition | Depends on power supply |
Bottom line: All three share the same Linux kernel, ART runtime, Kotlin compiler, and Gradle build system. Differences are in hardware HALs, UI navigation conventions, and safety requirements. Code written for Android TV can be ported to AAOS by changing only the camera input source and UI navigation patterns — the BevProcessor is already platform-agnostic.
The Camera access row is the most important. On Android TV we load static PNGs. On AAOS we use EVS. That single change transforms this into a production vehicle system.
A driver has four cameras around the vehicle but cannot perceive the surrounding environment as a unified top-down map. Each camera shows only its own distorted fisheye perspective. Stitching them into a coherent bird's-eye view requires solving the inverse perspective problem for wide-angle fisheye lenses — without a depth sensor and in real time on embedded hardware.
For every pixel in a 500×500 top-down output canvas, compute its real-world ground position in metres, project through each of the 4 fisheye camera models using the Mei projection equation, sample the colour, and blend all contributions weighted by viewing angle and distance. Runs entirely on Jetson Nano CPU in ~3 seconds with no GPU, no network, no AI.
Forward-looking Binocular Surround-view Semantic Estimation and Mapping. Unity-simulated parking lot scenes with 4 calibrated fisheye cameras and overhead ground-truth BEV images. Provides calibration files: intrinsics.yml + extrinsics.txt. Paper: arXiv:2303.03651.
The problem is fundamental geometry — how do you unwarp four fisheye perspectives into one overhead view? This is called AVM (Around View Monitor) in the automotive industry and is standard in premium vehicles.
No training data — only calibration files
Deterministic — identical input = identical output
Fully explainable — every pixel traceable to a formula
Runs on CPU — no GPU required
Tiny APK — no model weights
Oblique-angle blur on side lanes — physics limit
Flat-ground assumption — 3D objects warp
Car body projects into BEV image
Sharp side-lane reconstruction from learned priors
Can model 3D objects with learned depth cues
Can produce semantic labels — lane, vehicle, road
State-of-the-art visual quality on benchmarks
Requires GPU — TensorRT or ONNX Runtime
Needs thousands of labelled training images
Black box — hard to debug unexpected outputs
Large APK — model weights add 50–200 MB
Domain gap — may fail on unseen environments
IPM provides geometric accuracy as backbone
Smaller network — only corrects IPM artifacts
Interpretable: geometry debuggable, ML refines residuals
Best tradeoff between quality and complexity
More complex — two stages to maintain
Still needs some training data
Still needs GPU for refinement pass
We chose Method 1: the Jetson Nano CPU cannot run a neural network in real time, and we wanted a fully explainable system. Method 3 is the natural next step when GPU compute becomes available.
| Board | NVIDIA Jetson Nano |
| OS | LineageOS / Android TV |
| CPU | ARM Cortex-A57 quad-core @ 1.43 GHz |
| RAM | 4 GB LPDDR4 |
| GPU | 128-core Maxwell (not used by this app) |
| APK target SDK | API 23 (Android 6.0+) |
| Compute time | ~3 seconds (CPU only) |
Origin = centre of rear axle. X=forward, Y=right, Z=up. rz=180° cameras are physically mounted upside-down and are flipped 180° in code before processing.
| Camera | X (m) | Y (m) | Z (m) | rz handling |
|---|---|---|---|---|
| Front | 0.000 | 0.406 | 3.873 | flip 180° |
| Left | -1.024 | 0.800 | 2.053 | no flip |
| Rear | 0.132 | 0.744 | -1.001 | flip 180° |
| Right | 1.015 | 0.801 | 2.040 | no flip |
Measured through physical calibration using a checkerboard pattern. Specific to this camera model.
| ξ (xi) | 1.7634 | Mei fisheye parameter |
| fx | 331.0 px | Horizontal focal length |
| fy | 331.0 px | Vertical focal length |
| cx | 256.0 px | Principal point X |
| cy | 256.0 px | Principal point Y |
| Image size | 512 × 512 px | After scaling |
Each camera's horizontal facing direction. Used to build the 3×3 rotation matrix and compute angular blend weights.
| Camera | Yaw (°) | Yaw (rad) | Direction |
|---|---|---|---|
| Front | 0° | 0 | Straight ahead |
| Right | 90° | π/2 | Rightward |
| Rear | 180° | π | Backward |
| Left | 270° | 3π/2 | Leftward |
The rz=180° physical mounting is the most surprising hardware detail. The front and rear cameras are literally mounted upside-down. Without the bitmap rotation correction, their images appear inverted in the BEV output.
| Output size | 500 × 500 px |
| Scale | 0.02 m/px = 2 cm per pixel |
| Total coverage | 10 m × 10 m ground area |
| Origin pixel | (col=250, row=250) |
| Forward range (X) | −5 m to +5 m |
| Lateral range (Y) | −5 m to +5 m |
In the output image, UP = forward (ahead of car). Row 0 is farthest forward. Row 500 is farthest backward. This matches the natural driving perspective — what's ahead is at the top.
The coordinate system is the foundation of everything. The origin at (250,250) represents the real-world rear axle — the physical reference point for all camera extrinsics.
For each camera pixel (u,v) → find where it lands on the ground. Problem: many fisheye pixels point at the sky, other vehicles, or buildings — not the ground at all. Requires depth estimation or LiDAR to resolve where each ray intersects the ground plane. Far more complex.
For each output BEV pixel → compute its real-world ground position → ask each camera: which of your pixels shows this ground location? Guaranteed to only sample ground-level content. No depth sensor needed — we assume Y=0 (flat ground) which is exact for road surface.
Every BEV pixel maps to height Y=0. Exact for road surface, parking lines, markings. Fails for 3D objects — other cars, walls, pedestrians — which appear stretched because their actual height is non-zero. This is the fundamental physical limitation of all camera-only IPM systems.
The inversion is the entire trick. We never ask "where does this camera pixel land?" We ask "which camera pixel shows this ground location?" The flat ground assumption is the price for not needing depth sensors.
Converts a direction vector from world coordinates into the camera's own local coordinate frame. After this, xc is rightward in the camera frame, yc is upward, and zc is depth along the optical axis. If zc is negative the ground point is behind the camera — skip it.
Row 1 (X axis) — camera right direction, perpendicular to optical axis in the horizontal plane
Row 2 (Y axis) — camera up direction, computed as cross product of Z×X
Row 3 (Z axis) — optical axis direction the camera faces = (cos(yaw), sin(yaw), 0)
cos(0°)=1, sin(0°)=0. Z-axis=(1,0,0) points straight forward. X-axis=(0,1,0) points right. For a ground point 5m ahead: after transform xc=0 (no lateral), yc=0 (no vertical), zc=5 (directly in front at positive depth).
The matrix is pre-built once per camera before the main loop. Building it 250,000 times inside the loop would be pure waste. The 9 elements are extracted to local variables before the pixel loop for CPU register efficiency.
Standard pinhole: u = fx×(xc/zc)+cx. Fails above ~160° FOV because zc approaches zero for near-horizontal rays (division by near-zero → extreme distortion) and becomes negative for rays beyond 90° (point appears behind camera even though the lens physically sees it). Our fisheye cameras exceed 180° FOV.
Project the 3D point onto a unit sphere. Then shift the projection centre inward along the optical axis by ξ (xi). This single parameter controls how fisheye the projection is. ξ=0 recovers standard pinhole exactly. As ξ increases, wider angles compress into the image, allowing 200°+ FOV.
| ξ = 0.0 | Standard pinhole — fails above ~160° FOV |
| ξ = 0.5 | Mild wide-angle lens |
| ξ = 1.0 | Fisheye ~180° field of view |
| ξ = 1.76 | Our cameras — extreme fisheye, ~200° FOV |
Dividing xc by r3d places the 3D point on the surface of a unit sphere. The result is the x-direction cosine — always between −1 and +1. Dividing by mzXi applies the Mei perspective division. The larger ξ is, the more mzXi stays away from zero even for wide angles, which is what enables the fisheye to see beyond 90°.
When ξ=0: mzXi = zc/r3d = cosine of the angle from the optical axis. This recovers pinhole exactly. The ξ shift is the sole mathematical difference between fisheye and pinhole projection.
Multiple cameras can see the same ground point. All contributions are blended. Each camera's weight is proportional to how directly it looks at the point and inversely proportional to the distance from the car origin. This gives smooth transitions between camera zones with no hard seams.
The projected coordinate (u,v) is almost never a whole number. Without interpolation the output has a coarse blocky texture. Bilinear blends the four integer-coordinate neighbours weighted by how close the point is to each corner.
The cosA ≤ 0 check runs BEFORE the matrix multiply and Mei projection — eliminating ~50% of iterations before any expensive maths. This single early exit is the biggest single performance optimisation in the whole pipeline.
cosA^1.5 is a tuneable parameter. Lower = smoother blending but blurrier. Higher = sharper camera zone boundaries but more visible transition lines. 1.5 was found empirically to give the best result on this dataset.
For each output pixel, all 4 cameras may contribute. Their colour contributions are summed with individual weights into parallel float arrays. After the complete loop, dividing by total weight gives the final weighted-average colour. Two-pass approach guarantees mathematically correct blend regardless of camera count.
We cannot compute the final average during the inner loop because when processing the first camera for a pixel we don't yet know the total weight from all 4 cameras. Accumulate first, normalise second. Same principle used in HDR tone mapping, alpha compositing, and radiosity rendering.
Four parallel FloatArrays — bevR, bevG, bevB, totalW — each 250,000 elements (500×500). Flat arrays have sequential memory layout with excellent CPU cache locality. Reading the next element is one pointer increment. A 2D array or Map would add bounds-checking overhead on every one of the 1,000,000 inner-loop iterations.
Pixels directly under the car body are too close for any camera to see the ground through the chassis. Their totalW stays near zero. They remain black in the output bitmap and are covered by the green car box overlay drawn on top.
The two-pass weighted mean is standard in signal processing: accumulate numerator (weighted sum) and denominator (total weight) separately, then divide once. coerceIn(0,255) clamps floating point rounding errors.
Converts world metres back to BEV pixel position — the exact inverse of the main loop formula.
| Car box | ±2m forward/rear, ±1m lateral + CAR_Y_OFFSET. Dark fill hides projection artifacts. |
| Arrow | Shaft from rear(−1.2m) to front(+1.5m). Tip UP = forward. Wings spread left+right downward = ∧ arrowhead. |
| 2m ring | Cyan, radius = 2/SCALE = 100px, centred on (250,250). |
| 4m ring | Blue, radius = 4/SCALE = 200px, centred on (250,250). |
The rear camera is mounted at X=+0.132m (slightly right of centre). Combined with the fisheye projection geometry, the car body appears shifted rightward in the rendered BEV texture. CAR_Y_OFFSET shifts the overlay box to visually match. Tune by visual comparison with an overhead reference image for a different vehicle.
The original code had both arrowhead wings going to bx−10 — both going LEFT, making a sideways ⟩ shape. Fix: left wing = (bx−10, by+10), right wing = (bx+10, by+10). Both go downward (by+10 = larger row = backward in image), spreading left and right to form a correct ∧ arrowhead pointing toward the front of the car.
toPx(0f, 0f) = column 250, row 250 — the pixel coordinate of the real-world origin (rear axle). Rings represent physical 2m and 4m radius circles around the vehicle's rear axle, giving the driver a calibrated spatial reference in the display.
toPx() is the exact algebraic inverse of the main loop formula. Forward direction maps to smaller row numbers because row 0 is the top of the image = farthest forward. This is why the arrowhead at the front appears at the top of the green box.
Declared with mutableStateOf() inside remember{}. Any change automatically triggers a Compose recomposition — affected UI parts redraw themselves with no manual invalidation.
| bevBitmap | Computed BEV result bitmap or null |
| gtBitmap | Cropped ground truth bitmap or null |
| showingGt | Toggle — which image to display |
| isLoading | Controls progress bar and button disabled state |
| statusText | Status line shown under the title |
| timingText | Elapsed compute time in milliseconds |
Column — vertical layout fills screen with dark background (#111111)
Box with weight(1f) — image display expands to fill all remaining vertical space between title and buttons
Image(ContentScale.Fit) — scales the 500×500 BEV bitmap to fill the box preserving its square aspect ratio
BevButton (TV Material3) — focusedContainerColor=green makes the selected button clearly visible under D-pad navigation
LinearProgressIndicator — shown during BEV computation to indicate background activity
LaunchedEffect(Unit) — auto-triggers computeBev() on first launch so BEV appears immediately
Writes PNG to /sdcard/Documents/BevSnapshots/ using FileOutputStream + bitmap.compress(PNG, quality=100). Creates directory if absent. Lossless PNG preserves all pixel data. Full file path shown in statusText after successful save.
Threading is the most critical Android-specific engineering decision. Without withContext(Dispatchers.Default), the 3-5 second computation blocks the main thread and Android shows an ANR dialog after 5 seconds.
| Compute time | ~3 seconds (Jetson Nano CPU only) |
| Total iterations | 500×500×4 = 1,000,000 |
| Working memory | 4 arrays × 250K × 4B ≈ 4 MB |
| Output bitmap | 500×500 ARGB ≈ 1 MB |
| Input bitmaps | 4 × 512×512 ARGB ≈ 4 MB |
| Total peak memory | ~10 MB — well within Jetson RAM |
Road texture directly ahead and behind — geometrically accurate
Lane markings in front and rear zones — sharp and correctly positioned
Distance rings calibrated to real metres within 2cm precision
Camera seams blend smoothly with no visible hard transition lines
Side lane blur: Lanes 4–6m lateral are seen at ~10° above horizontal. 1px fisheye error = ~50cm ground error. Not fixable without AI or depth sensors — this is a physics constraint.
3D object warping: Other cars, walls, pedestrians appear stretched because IPM assumes Y=0. Their actual height above ground is projected as if they were flat.
Car body artifact: Side cameras project the ego vehicle's own body into the BEV. Mitigated by the dark car box overlay. Fundamental to all camera-only IPM.
Multi-thread with Kotlin coroutines — parallel per-camera passes → target <1 second
Real-time camera feed via Android Camera2 API or AAOS EVS
F2BEV neural network refinement pass for side-lane artifact correction
Port to AAOS with EVS camera access for in-vehicle production deployment
The side lane blur is not a bug — it's physics. At 6m lateral, the camera sees ground at less than 10° elevation. One pixel maps to ~50cm of ground. No software tuning fixes this without depth information.
No server. No neural network at runtime. Just camera calibration, rotation matrices, and the Mei fisheye model — running on commodity Android hardware.
Hold this slide for 5 seconds. The GitHub link is in the video description. The docs/ folder contains the full algorithm flowchart, animated walkthrough, and this slide deck.