SwiftUI+Vision: Object Detection in Live Capture

This content originally appeared on Level Up Coding - Medium and was authored by Itsuki

by ChatGPT, Dalle 3

**********

Disclaimer: We are using BETA Features!!!

**********

This article assumes you have some basic understanding on

using AVFoundation for camera processing, and
using Vision for detection in still image

If you need a catch on that, feel free to check out some of my previous articles:

If you are ready, let’s see how we can use the Vision framework to detect objects in live capture.

I will do my example on TextDetection but you can easily change it (or add more to it) by using other VisionRequests such as DetectBarcodesRequest since all of them are performed in the same way!

You can find the entire demo app on GitHub. Please feel free to grab it and play around!

Set Up

Like always, when we need to access user’s camera, we need to add the NSCameraUsageDescription key to our info.plist.

Camera For Preview

Before we move onto the main Vision part, let’s first set up our camera for previewing.

import SwiftUI
import AVFoundation

class CameraManager: NSObject {
    
    private let captureSession = AVCaptureSession()
    
    private var isCaptureSessionConfigured = false
    private var deviceInput: AVCaptureDeviceInput?
    
    // for preview
    private var videoOutput: AVCaptureVideoDataOutput?
    private var sessionQueue: DispatchQueue!

    // for preview device output
    var isPreviewPaused = false

    private var addToPreviewStream: ((CIImage) -> Void)?
    
    lazy var previewStream: AsyncStream<CIImage> = {
        AsyncStream { continuation in
            addToPreviewStream = { ciImage in
                if !self.isPreviewPaused {
                    continuation.yield(ciImage)
                }
            }
        }
    }()
    
    
    override init() {
        super.init()
        // The value of this property is an AVCaptureSessionPreset indicating the current session preset in use by the receiver. The sessionPreset property may be set while the receiver is running.
        captureSession.sessionPreset = .low
        sessionQueue = DispatchQueue(label: "session queue")
    }

    func start() async {
        let authorized = await checkAuthorization()
        guard authorized else {
            print("Camera access was not authorized.")
            return
        }
        
        if isCaptureSessionConfigured {
            if !captureSession.isRunning {
                sessionQueue.async { [self] in
                    self.captureSession.startRunning()
                }
            }
            return
        }
        
        sessionQueue.async { [self] in
            self.configureCaptureSession { success in
                guard success else { return }
                self.captureSession.startRunning()
            }
        }
    }
    
    func stop() {
        guard isCaptureSessionConfigured else { return }
        
        if captureSession.isRunning {
            sessionQueue.async {
                self.captureSession.stopRunning()
            }
        }
    }
    
    private func configureCaptureSession(completionHandler: (_ success: Bool) -> Void) {
        
        var success = false
        
        self.captureSession.beginConfiguration()
        
        defer {
            self.captureSession.commitConfiguration()
            completionHandler(success)
        }
        
        guard
            let captureDevice = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: .video, position: .back).devices.first,
            let deviceInput = try? AVCaptureDeviceInput(device: captureDevice)
        else {
            print("Failed to obtain video input.")
            return
        }
        
        captureSession.sessionPreset = AVCaptureSession.Preset.vga640x480
        
        let videoOutput = AVCaptureVideoDataOutput()
        videoOutput.alwaysDiscardsLateVideoFrames = true
        videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]
        videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "VideoDataOutputQueue"))
        if let videoOutputConnection = videoOutput.connection(with: .video) {
            if videoOutputConnection.isVideoMirroringSupported {
                videoOutputConnection.isVideoMirrored = false
            }
        }
 
        guard captureSession.canAddOutput(videoOutput) else {
            print("Unable to add video output to capture session.")
            return
        }
        
        captureSession.addInput(deviceInput)
        captureSession.addOutput(videoOutput)
        
        self.deviceInput = deviceInput
        self.videoOutput = videoOutput
        
        isCaptureSessionConfigured = true
        
        success = true
    }
    
    
    private func checkAuthorization() async -> Bool {
        switch AVCaptureDevice.authorizationStatus(for: .video) {
        case .authorized:
            print("Camera access authorized.")
            return true
        case .notDetermined:
            print("Camera access not determined.")
            sessionQueue.suspend()
            let status = await AVCaptureDevice.requestAccess(for: .video)
            sessionQueue.resume()
            return status
        case .denied:
            print("Camera access denied.")
            return false
        case .restricted:
            print("Camera library access restricted.")
            return false
        default:
            return false
        }
    }

}


extension CameraManager: AVCaptureVideoDataOutputSampleBufferDelegate {
    
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        guard let pixelBuffer = sampleBuffer.imageBuffer else { return }
        connection.videoRotationAngle = RotationAngle.portrait.rawValue
        addToPreviewStream?(CIImage(cvPixelBuffer: pixelBuffer))
    }
    
    func captureOutput(_ output: AVCaptureOutput, didDrop sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
//        print("frame dropped")
    }
}


private enum RotationAngle: CGFloat {
    case portrait = 90
    case portraitUpsideDown = 270
    case landscapeRight = 180
    case landscapeLeft = 0
}



extension CIImage {
    var image: Image? {
        let ciContext = CIContext()
        guard let cgImage = ciContext.createCGImage(self, from: self.extent) else { return nil }
        return Image(decorative: cgImage, scale: 1, orientation: .up)
    }
}

All we need to know from the code above are the following.

Camera preview will be available through previewStream: AsyncStream<CIImage>.
To start the camera, call start()
To stop the camera, call stop()

If you do want a little more insights, here are couple Notes I would like to make!

One!

Make sure to specify the pixel format for videoOutput!

videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]

Two!

Also, I have stated videoOutput.alwaysDiscardsLateVideoFrames = true explicitly. It is default to true indeed but I do want to point this out!

Vision requests are really expensive and somewhat time consuming. The camera will stop working if the buffer queue overflows available memory. We want to allow AVFoundation to drop frames when necessary.

If you have token a look at my previous SwiftUI: Camera App with AVFoundation, you might realize what we have here is a lot simpler. This is because we don’t need to switch between capture devices, nor dealing with video recording or photo taking!

Vision For Text Detection

Let me first share the code with you and then take a more detailed look at what we have here!


import SwiftUI
import Vision


class VisionManager {

    private var requests: [any VisionRequest] = []
    private var isProcessing: Bool = false
    
    private var addToObservationStream: (([any VisionObservation]) -> Void)?
    
    lazy var observationStream: AsyncStream<[any VisionObservation]> = {
        AsyncStream { continuation in
            addToObservationStream = { observations in
                continuation.yield(observations)
            }
        }
    }()
    
    
    init() {
        var request = RecognizeTextRequest()
        request.recognitionLanguages = [Locale.Language.init(identifier: "en-US"), Locale.Language.init(identifier: "ja-JP")]
        self.requests = [request]
    }
    
    
    // MARK: For Live Capture Processing
    func processLiveDetection(_ ciImage: CIImage) {
        if isProcessing { return }
        isProcessing = true

        Task {
            let imageRequestHandler = ImageRequestHandler(ciImage)
            let results = imageRequestHandler.performAll(self.requests)

            await handleVisionResults(results: results)
            isProcessing = false
        }
    }

    
    // MARK: Process Results and Observations
    @MainActor
    private func handleVisionResults(results: some AsyncSequence<VisionResult, Never>) async  {
        var newObservations: [any VisionObservation] = []
        
        for await result in results {
            switch result {
            case .recognizeText(_, let observations):
                newObservations.append(contentsOf: observations)
            default:
                return
            }
        }
        addToObservationStream?(newObservations)
    }

    
    @MainActor
    func processObservation(_ observation: any VisionObservation, for imageSize: CGSize) -> (text: String, confidence: Float, size: CGSize, position: CGPoint) {
        switch observation {
        case is RecognizedTextObservation:
            return processTextObservation(observation as! RecognizedTextObservation, for: imageSize)
        default:
            return ("", .zero, .zero, .zero)
        }
    }

    private func processTextObservation(_ observation: RecognizedTextObservation, for imageSize: CGSize) -> (text: String, confidence: Float, size: CGSize, position: CGPoint) {
        let recognizedText = observation.topCandidates(1).first?.string ?? ""
        let confidence = observation.topCandidates(1).first?.confidence ?? 0.0
       
        let boundingBox = observation.boundingBox
        let converted = boundingBox.toImageCoordinates(imageSize, origin: .upperLeft)

        let position = CGPoint(x: converted.midX, y: converted.midY)

        return (recognizedText, confidence, converted.size, position)
    }
       
}

Similar to our CameraManager, the observations (ie: the detection result) will be made available through var observationStream: AsyncStream<[any VisionObservation]>. We will take a more detailed look at how we will use those AsyncStream in the next section.

To process an image, we will call processLiveDetection.

func processLiveDetection(_ ciImage: CIImage) {
    if isProcessing { return }
    isProcessing = true

    Task {
        let imageRequestHandler = ImageRequestHandler(ciImage)
        let results = imageRequestHandler.performAll(self.requests)

        await handleVisionResults(results: results)
        isProcessing = false
    }
}

Two important points I would like to point out here!

Since I only want to process ONE vision request at a time, I have added a isProcessing flag to check. The reason for that, as I mentioned above, is because vision requests are expensive. I don’t want my App to freeze nor I don’t think it is necessary to try to process all the frames!
Don’t make this function async as that will block the camera preview. It might not be obvious at this point but I promise that it will become clearer in this next section where we will putting together our CameraManager and VisionManager.

Since we are only interested in detecting text, when processing VisionResults and VisionObservations , we are simply returning for all other cases.

If you have other request types in addition to Text, for example, barcode, simply add the corresponding cases to handleVisionResults and processObservation.

We will be using the processObservation function later when adding Bounded Rectangle and Text to our preview Images.

Detection Model

This will be the ObservableObject used in our views that also keeps a reference to both CameraManager and VisionManager.

import SwiftUI
import Vision


class DetectionModel: ObservableObject {
    
    let camera = CameraManager()
    let vision = VisionManager()
    
    @Published var observations: [any VisionObservation] = []
    @Published var previewImage: CIImage?
    @Published var isDetecting: Bool = false {
        didSet {
            if isDetecting {
                if framePerSecond > 0 {
                    self.timer = Timer.scheduledTimer(timeInterval: 1.0/Double(framePerSecond), target: self, selector: #selector(timerFired), userInfo: nil, repeats: true)
                }
            } else {
                self.observations = []
                timer?.invalidate()
            }
        }
    }
    
    var framePerSecond = 0 {
        willSet {
            if newValue < 0 {
                self.framePerSecond = 0
            } else {
                self.framePerSecond = min(newValue, maxFramePerSecond)
            }
        }
    }
    
    let maxFramePerSecond = 30

    private var timer: Timer? = nil
    
    init() {
        Task {
            await handleCameraPreviews()
        }
        
        Task {
            await handleVisionObservations()
        }
    }
    
    
    private func handleCameraPreviews() async {
        for await ciImage in camera.previewStream {
            Task { @MainActor in
                previewImage = ciImage
                
                // Continuous processing
                if isDetecting, framePerSecond <= 0 {
                    vision.processLiveDetection(ciImage)
                }
            }
        }
    }
    
    
    private func handleVisionObservations() async {
        for await observations in vision.observationStream {
            Task { @MainActor in
                if isDetecting {
                    self.observations = observations
                }
            }
        }
    }

    // Processing based on FPS specified
    @objc private func timerFired() {
        guard isDetecting, let ciImage = previewImage else {
            return
        }
        vision.processLiveDetection(ciImage)
    }
    
}

First of all, we have handleCameraPreviews and handleVisionObservations to pull from the AsyncStream in each Manager class and update the corresponding Published variables.

I have added a little framePerSecond feature so that we can choose between processing preview continuously or every time timer fires. I have set the max to be 30 here but change it as you wish!

If we are detecting, ie: isDetecting == true, and processing continuously with framePerSecond set to 0, we will call vision.processLiveDetection in our handleCameraPreviews. Otherwise, it will called on timerFired.

A tiny Side note here.

if you are running into Attempting to store to property within its own willSet warning, simply use explicit self, ie: self.framePerSecond instead of framePerSecond.

View

Yes, we are finally here!

Ready to put together our View for live capture and detection!


import SwiftUI

struct LiveDetection: View {
    
    @StateObject private var detectionModel = DetectionModel()
    @State private var sliderValue: Double = .zero
    @State private var imageSize: CGSize = .zero

    var body: some View {
        
        VStack(spacing: 16) {
            detectionModel.previewImage?.image?
                .resizable()
                .scaledToFit()
                .overlay(content: {
                    GeometryReader { geometry in
                        DispatchQueue.main.async {
                            self.imageSize = geometry.size
                        }
                        return Color.clear
                    }
                })
                .overlay(content: {
                    ForEach(0..<detectionModel.observations.count, id: \.self) { index in

                        let observation = detectionModel.observations[index]

                        let (text, confidence, boxSize, boxPosition) = detectionModel.vision.processObservation(observation, for: imageSize)

                        RoundedRectangle(cornerRadius: 8)
                            .stroke(.black, style: .init(lineWidth: 4.0))
                            .overlay(alignment: .topLeading, content: {
                                Text("\(text): \(confidence)")
                                    .background(.white)
                                    .offset(y: -28)
                            })
                            .frame(width: boxSize.width, height: boxSize.height)
                            .position(boxPosition)
                    }
                })
            
            Button(action: {
                detectionModel.isDetecting.toggle()
            }, label: {
                Text("\(detectionModel.isDetecting ? "Stop" : "Start") Detection")
                    .foregroundStyle(.black)
                    .padding(.all)
                    .background(
                        RoundedRectangle(cornerRadius: 8)
                            .stroke(.black, lineWidth: 2.0))
            })
            
            Slider(
                value: $sliderValue,
                in: 0...Double(detectionModel.maxFramePerSecond),
                step: 1,
                onEditingChanged: {changing in
                    guard !changing else {return}
                    detectionModel.framePerSecond = Int(sliderValue)
                }
            )
            Text("Vision Processing FPS: \(Int(sliderValue))")
            Text("When FPS is 0, processing continuously.")
                .foregroundStyle(.red)

        }
        .task {
            await detectionModel.camera.start()
        }
        .onDisappear {
            detectionModel.camera.stop()
        }
        .frame(maxWidth: .infinity, maxHeight: .infinity, alignment: .top)
        .padding(.all, 32)
    }
}

Pretty much exactly the same as what we had in my previous article: SwiftUI + Vision: Object detection in Still Image, except for a couple more new features.

Start and stop the camera on appear and disappear.
Toggle between detecting.
A slider for adjusting frames processed by vision per second.

Let’s test it out!

I know, as I always say, I am BAD at taking photos/videos! Forgive me on that!

But we have it! Yeah!

Thank you for reading!

That’s all I have for today!

Planning on making some human? object? tracking app next so stay tuned if you are interested!

Again, you can grab the demo app here.

Happy live capturing!

SwiftUI+Vision: Object Detection in Live Capture was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Itsuki

Print Share Comment Cite Upload Translate Updates

APA

Itsuki | Sciencx (2024-07-28T16:38:51+00:00) SwiftUI+Vision: Object Detection in Live Capture. Retrieved from https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/

MLA

" » SwiftUI+Vision: Object Detection in Live Capture." Itsuki | Sciencx - Sunday July 28, 2024, https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/

HARVARD

Itsuki | Sciencx Sunday July 28, 2024 » SwiftUI+Vision: Object Detection in Live Capture., viewed ,<https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/>

VANCOUVER

Itsuki | Sciencx - » SwiftUI+Vision: Object Detection in Live Capture. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/

CHICAGO

" » SwiftUI+Vision: Object Detection in Live Capture." Itsuki | Sciencx - Accessed . https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/

IEEE

" » SwiftUI+Vision: Object Detection in Live Capture." Itsuki | Sciencx [Online]. Available: https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/. [Accessed: ]

rf:citation

» SwiftUI+Vision: Object Detection in Live Capture | Itsuki | Sciencx | https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Set Up

Camera For Preview

Vision For Text Detection

Detection Model

View

Related Posts