This talk will introduce our work to reduce CPU overhead of qemu+librbd stack, which includes use rbd_aio_writev instead of rbd_aio_write for qemu rbd driver, and further optimize rbd_aio_writev to use zero copy to send data, these optimizations lead to 48% less cpu cost, 46% less latency, and 85% higher iops for 1M sequential write. In addition, we improve the scalability of librbd by using multiple writeback threads, and reduce the granularity of rbd cache lock.