This content originally appeared on Level Up Coding - Medium and was authored by Itsuki
**********
Disclaimer: We are using BETA Features!!!
**********
This article assumes you have some basic understanding on
- using AVFoundation for camera processing, and
- using Vision for detection in still image
If you need a catch on that, feel free to check out some of my previous articles:
If you are ready, let’s see how we can use the Vision framework to detect objects in live capture.
I will do my example on TextDetection but you can easily change it (or add more to it) by using other VisionRequests such as DetectBarcodesRequest since all of them are performed in the same way!
You can find the entire demo app on GitHub. Please feel free to grab it and play around!
Set Up
Like always, when we need to access user’s camera, we need to add the NSCameraUsageDescription key to our info.plist.
Camera For Preview
Before we move onto the main Vision part, let’s first set up our camera for previewing.
import SwiftUI
import AVFoundation
class CameraManager: NSObject {
private let captureSession = AVCaptureSession()
private var isCaptureSessionConfigured = false
private var deviceInput: AVCaptureDeviceInput?
// for preview
private var videoOutput: AVCaptureVideoDataOutput?
private var sessionQueue: DispatchQueue!
// for preview device output
var isPreviewPaused = false
private var addToPreviewStream: ((CIImage) -> Void)?
lazy var previewStream: AsyncStream<CIImage> = {
AsyncStream { continuation in
addToPreviewStream = { ciImage in
if !self.isPreviewPaused {
continuation.yield(ciImage)
}
}
}
}()
override init() {
super.init()
// The value of this property is an AVCaptureSessionPreset indicating the current session preset in use by the receiver. The sessionPreset property may be set while the receiver is running.
captureSession.sessionPreset = .low
sessionQueue = DispatchQueue(label: "session queue")
}
func start() async {
let authorized = await checkAuthorization()
guard authorized else {
print("Camera access was not authorized.")
return
}
if isCaptureSessionConfigured {
if !captureSession.isRunning {
sessionQueue.async { [self] in
self.captureSession.startRunning()
}
}
return
}
sessionQueue.async { [self] in
self.configureCaptureSession { success in
guard success else { return }
self.captureSession.startRunning()
}
}
}
func stop() {
guard isCaptureSessionConfigured else { return }
if captureSession.isRunning {
sessionQueue.async {
self.captureSession.stopRunning()
}
}
}
private func configureCaptureSession(completionHandler: (_ success: Bool) -> Void) {
var success = false
self.captureSession.beginConfiguration()
defer {
self.captureSession.commitConfiguration()
completionHandler(success)
}
guard
let captureDevice = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: .video, position: .back).devices.first,
let deviceInput = try? AVCaptureDeviceInput(device: captureDevice)
else {
print("Failed to obtain video input.")
return
}
captureSession.sessionPreset = AVCaptureSession.Preset.vga640x480
let videoOutput = AVCaptureVideoDataOutput()
videoOutput.alwaysDiscardsLateVideoFrames = true
videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]
videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "VideoDataOutputQueue"))
if let videoOutputConnection = videoOutput.connection(with: .video) {
if videoOutputConnection.isVideoMirroringSupported {
videoOutputConnection.isVideoMirrored = false
}
}
guard captureSession.canAddOutput(videoOutput) else {
print("Unable to add video output to capture session.")
return
}
captureSession.addInput(deviceInput)
captureSession.addOutput(videoOutput)
self.deviceInput = deviceInput
self.videoOutput = videoOutput
isCaptureSessionConfigured = true
success = true
}
private func checkAuthorization() async -> Bool {
switch AVCaptureDevice.authorizationStatus(for: .video) {
case .authorized:
print("Camera access authorized.")
return true
case .notDetermined:
print("Camera access not determined.")
sessionQueue.suspend()
let status = await AVCaptureDevice.requestAccess(for: .video)
sessionQueue.resume()
return status
case .denied:
print("Camera access denied.")
return false
case .restricted:
print("Camera library access restricted.")
return false
default:
return false
}
}
}
extension CameraManager: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let pixelBuffer = sampleBuffer.imageBuffer else { return }
connection.videoRotationAngle = RotationAngle.portrait.rawValue
addToPreviewStream?(CIImage(cvPixelBuffer: pixelBuffer))
}
func captureOutput(_ output: AVCaptureOutput, didDrop sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
// print("frame dropped")
}
}
private enum RotationAngle: CGFloat {
case portrait = 90
case portraitUpsideDown = 270
case landscapeRight = 180
case landscapeLeft = 0
}
extension CIImage {
var image: Image? {
let ciContext = CIContext()
guard let cgImage = ciContext.createCGImage(self, from: self.extent) else { return nil }
return Image(decorative: cgImage, scale: 1, orientation: .up)
}
}
All we need to know from the code above are the following.
- Camera preview will be available through previewStream: AsyncStream<CIImage>.
- To start the camera, call start()
- To stop the camera, call stop()
If you do want a little more insights, here are couple Notes I would like to make!
One!
Make sure to specify the pixel format for videoOutput!
videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]
Two!
Also, I have stated videoOutput.alwaysDiscardsLateVideoFrames = true explicitly. It is default to true indeed but I do want to point this out!
Vision requests are really expensive and somewhat time consuming. The camera will stop working if the buffer queue overflows available memory. We want to allow AVFoundation to drop frames when necessary.
If you have token a look at my previous SwiftUI: Camera App with AVFoundation, you might realize what we have here is a lot simpler. This is because we don’t need to switch between capture devices, nor dealing with video recording or photo taking!
Vision For Text Detection
Let me first share the code with you and then take a more detailed look at what we have here!
import SwiftUI
import Vision
class VisionManager {
private var requests: [any VisionRequest] = []
private var isProcessing: Bool = false
private var addToObservationStream: (([any VisionObservation]) -> Void)?
lazy var observationStream: AsyncStream<[any VisionObservation]> = {
AsyncStream { continuation in
addToObservationStream = { observations in
continuation.yield(observations)
}
}
}()
init() {
var request = RecognizeTextRequest()
request.recognitionLanguages = [Locale.Language.init(identifier: "en-US"), Locale.Language.init(identifier: "ja-JP")]
self.requests = [request]
}
// MARK: For Live Capture Processing
func processLiveDetection(_ ciImage: CIImage) {
if isProcessing { return }
isProcessing = true
Task {
let imageRequestHandler = ImageRequestHandler(ciImage)
let results = imageRequestHandler.performAll(self.requests)
await handleVisionResults(results: results)
isProcessing = false
}
}
// MARK: Process Results and Observations
@MainActor
private func handleVisionResults(results: some AsyncSequence<VisionResult, Never>) async {
var newObservations: [any VisionObservation] = []
for await result in results {
switch result {
case .recognizeText(_, let observations):
newObservations.append(contentsOf: observations)
default:
return
}
}
addToObservationStream?(newObservations)
}
@MainActor
func processObservation(_ observation: any VisionObservation, for imageSize: CGSize) -> (text: String, confidence: Float, size: CGSize, position: CGPoint) {
switch observation {
case is RecognizedTextObservation:
return processTextObservation(observation as! RecognizedTextObservation, for: imageSize)
default:
return ("", .zero, .zero, .zero)
}
}
private func processTextObservation(_ observation: RecognizedTextObservation, for imageSize: CGSize) -> (text: String, confidence: Float, size: CGSize, position: CGPoint) {
let recognizedText = observation.topCandidates(1).first?.string ?? ""
let confidence = observation.topCandidates(1).first?.confidence ?? 0.0
let boundingBox = observation.boundingBox
let converted = boundingBox.toImageCoordinates(imageSize, origin: .upperLeft)
let position = CGPoint(x: converted.midX, y: converted.midY)
return (recognizedText, confidence, converted.size, position)
}
}
Similar to our CameraManager, the observations (ie: the detection result) will be made available through var observationStream: AsyncStream<[any VisionObservation]>. We will take a more detailed look at how we will use those AsyncStream in the next section.
To process an image, we will call processLiveDetection.
func processLiveDetection(_ ciImage: CIImage) {
if isProcessing { return }
isProcessing = true
Task {
let imageRequestHandler = ImageRequestHandler(ciImage)
let results = imageRequestHandler.performAll(self.requests)
await handleVisionResults(results: results)
isProcessing = false
}
}
Two important points I would like to point out here!
- Since I only want to process ONE vision request at a time, I have added a isProcessing flag to check. The reason for that, as I mentioned above, is because vision requests are expensive. I don’t want my App to freeze nor I don’t think it is necessary to try to process all the frames!
- Don’t make this function async as that will block the camera preview. It might not be obvious at this point but I promise that it will become clearer in this next section where we will putting together our CameraManager and VisionManager.
Since we are only interested in detecting text, when processing VisionResults and VisionObservations , we are simply returning for all other cases.
If you have other request types in addition to Text, for example, barcode, simply add the corresponding cases to handleVisionResults and processObservation.
We will be using the processObservation function later when adding Bounded Rectangle and Text to our preview Images.
Detection Model
This will be the ObservableObject used in our views that also keeps a reference to both CameraManager and VisionManager.
import SwiftUI
import Vision
class DetectionModel: ObservableObject {
let camera = CameraManager()
let vision = VisionManager()
@Published var observations: [any VisionObservation] = []
@Published var previewImage: CIImage?
@Published var isDetecting: Bool = false {
didSet {
if isDetecting {
if framePerSecond > 0 {
self.timer = Timer.scheduledTimer(timeInterval: 1.0/Double(framePerSecond), target: self, selector: #selector(timerFired), userInfo: nil, repeats: true)
}
} else {
self.observations = []
timer?.invalidate()
}
}
}
var framePerSecond = 0 {
willSet {
if newValue < 0 {
self.framePerSecond = 0
} else {
self.framePerSecond = min(newValue, maxFramePerSecond)
}
}
}
let maxFramePerSecond = 30
private var timer: Timer? = nil
init() {
Task {
await handleCameraPreviews()
}
Task {
await handleVisionObservations()
}
}
private func handleCameraPreviews() async {
for await ciImage in camera.previewStream {
Task { @MainActor in
previewImage = ciImage
// Continuous processing
if isDetecting, framePerSecond <= 0 {
vision.processLiveDetection(ciImage)
}
}
}
}
private func handleVisionObservations() async {
for await observations in vision.observationStream {
Task { @MainActor in
if isDetecting {
self.observations = observations
}
}
}
}
// Processing based on FPS specified
@objc private func timerFired() {
guard isDetecting, let ciImage = previewImage else {
return
}
vision.processLiveDetection(ciImage)
}
}
First of all, we have handleCameraPreviews and handleVisionObservations to pull from the AsyncStream in each Manager class and update the corresponding Published variables.
I have added a little framePerSecond feature so that we can choose between processing preview continuously or every time timer fires. I have set the max to be 30 here but change it as you wish!
If we are detecting, ie: isDetecting == true, and processing continuously with framePerSecond set to 0, we will call vision.processLiveDetection in our handleCameraPreviews. Otherwise, it will called on timerFired.
A tiny Side note here.
if you are running into Attempting to store to property within its own willSet warning, simply use explicit self, ie: self.framePerSecond instead of framePerSecond.
View
Yes, we are finally here!
Ready to put together our View for live capture and detection!
import SwiftUI
struct LiveDetection: View {
@StateObject private var detectionModel = DetectionModel()
@State private var sliderValue: Double = .zero
@State private var imageSize: CGSize = .zero
var body: some View {
VStack(spacing: 16) {
detectionModel.previewImage?.image?
.resizable()
.scaledToFit()
.overlay(content: {
GeometryReader { geometry in
DispatchQueue.main.async {
self.imageSize = geometry.size
}
return Color.clear
}
})
.overlay(content: {
ForEach(0..<detectionModel.observations.count, id: \.self) { index in
let observation = detectionModel.observations[index]
let (text, confidence, boxSize, boxPosition) = detectionModel.vision.processObservation(observation, for: imageSize)
RoundedRectangle(cornerRadius: 8)
.stroke(.black, style: .init(lineWidth: 4.0))
.overlay(alignment: .topLeading, content: {
Text("\(text): \(confidence)")
.background(.white)
.offset(y: -28)
})
.frame(width: boxSize.width, height: boxSize.height)
.position(boxPosition)
}
})
Button(action: {
detectionModel.isDetecting.toggle()
}, label: {
Text("\(detectionModel.isDetecting ? "Stop" : "Start") Detection")
.foregroundStyle(.black)
.padding(.all)
.background(
RoundedRectangle(cornerRadius: 8)
.stroke(.black, lineWidth: 2.0))
})
Slider(
value: $sliderValue,
in: 0...Double(detectionModel.maxFramePerSecond),
step: 1,
onEditingChanged: {changing in
guard !changing else {return}
detectionModel.framePerSecond = Int(sliderValue)
}
)
Text("Vision Processing FPS: \(Int(sliderValue))")
Text("When FPS is 0, processing continuously.")
.foregroundStyle(.red)
}
.task {
await detectionModel.camera.start()
}
.onDisappear {
detectionModel.camera.stop()
}
.frame(maxWidth: .infinity, maxHeight: .infinity, alignment: .top)
.padding(.all, 32)
}
}
Pretty much exactly the same as what we had in my previous article: SwiftUI + Vision: Object detection in Still Image, except for a couple more new features.
- Start and stop the camera on appear and disappear.
- Toggle between detecting.
- A slider for adjusting frames processed by vision per second.
Let’s test it out!
I know, as I always say, I am BAD at taking photos/videos! Forgive me on that!
But we have it! Yeah!
Thank you for reading!
That’s all I have for today!
Planning on making some human? object? tracking app next so stay tuned if you are interested!
Again, you can grab the demo app here.
Happy live capturing!
SwiftUI+Vision: Object Detection in Live Capture was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Itsuki
Itsuki | Sciencx (2024-07-28T16:38:51+00:00) SwiftUI+Vision: Object Detection in Live Capture. Retrieved from https://www.scien.cx/2024/07/28/swiftuivision-object-detection-in-live-capture/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.