如何使用GPU加快ffmpeg过滤器的处理速度？

Question

如何使用GPU加快ffmpeg过滤器的处理速度？

根据NVIDIA开发人员网站的说法，您可以使用GPU来加快ffmpeg过滤器的渲染速度。

使用FFmpeg中的内置>过滤器创建高性能的端到端硬件加速视频处理，1：N编码和1：N转码管线

能够使用FFmpeg中的共享CUDA上下文实现添加自己的自定义高性能CUDA过滤器

我现在遇到的问题是如何使用GPU来加速多个ffmpeg过滤器处理？

例如：

ffmpeg -loop 1 -i dog.jpg -filter_complex "scale=iw*4:-1,zoompan=z='zoom+0.002':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':s=720x960" -pix_fmt yuv420p -vcodec libx264 -preset ultrafast -y -r:v 25 -t 5 -crf 28 dog.mp4

Run Code Online (Sandbox Code Playgroud)

Answer 1

林正浩*_*林正浩 11

When it comes to hardware acceleration in FFmpeg, you can expect the following implementations by type:

1. Hardware-accelerated encoders: In the case of NVIDIA, NVENC is supported and implemented via the h264_nvenc and the hevc_nvenc wrappers. See this answer on how to tune them, and any limitations you may run into depending on the generation of hardware you're on.

2. Hardware-accelerated filters: Filters that perform duties such as scaling and post-processing (deinterlacing, etc) are available in FFmpeg, and some implementations are hardware-accelerated. For NVIDIA, the following filters can take advantage of hardware-acceleration:

(a). scale_cuda: This is a scaling filter analogous to the generic scale filter, implemented in CUDA. It's dependency is the ffnvcodec project, headers needed to also enable the NVENC-based encoders. When the ffnvcodec headers are present, the respective filters dependent on it (scale_cuda and yadif_cuda) will be automatically enabled. In production, it may be wise to deprecate this filter in favor of scale_npp as it has a very limited set of options.

(b). scale_npp: This is a scaling filter implemented in NVIDIA's Performance Primitives. It's primary dependency is the CUDA SDK, and it must be explicitly enabled by passing --enable-libnpp, --enable-cuda-nvcc and --enable-nonfree flags to ./configure at compile time when building FFmpeg from source. Use this filter in place of scale_cuda wherever possible.

(c). yadif_cuda: This is a deinterlacer, implemented in CUDA. It's dependency, as stated above, is the ffnvcodec package of headers.

(d). All OpenCL-based filters: All NVENC-capable GPUs supported by both the mainline NVIDIA driver and the CUDA SDK implement OpenCL support. I started this section with this clarification because there's news in the wind that NVIDIA will be deprecating mobile Kepler GPUs in their mainline driver, relegating them to Legacy support status. For this reason, if you're on such a platform, take this into consideration.

To enable these filters, pass --enable-opencl to FFmpeg's ./configure script at build time. Note that this requires the OpenCL headers to be present on your system, and can be safely satisfied by your package manager on whatever Linux distribution you're on. On other operating systems, your mileage may vary.

To see all OpenCL-based filters, run:

ffmpeg -h filters | grep opencl

Run Code Online (Sandbox Code Playgroud)

A few notable examples being unsharp_opencl,avgblur_opencl, etc. See this wiki section for more options.

A note pertaining to performance with OpenCL filters: Please take into account any overheads that mechanisms introduced by filter chains such as hwupload and hwdownload may introduce into your pipeline, as uploading textures to and from system memory and the accelerator in question will affect performance, and so will format conversion operations (via the format filter) where needed/required. In this case, it may be beneficial to take advantage of the hwmap filter, and deriving contexts where applicable. For instance, VAAPI has a mechanism that allows for OpenCL device derivation and reverse mapping via hwmap, if the cl_intel_va_api_media_sharing OpenCL extension is present. This is typically provided by the Beignet ICD, and is absent in others, such as the newer Neo OpenCL driver.

3. Hardware-accelerated decoders (and their associated wrappers): Depending on your input source, and the capabilities of your NVIDIA GPU, based on generation, you may also tap into hardware accelerations based on either CUVID or NVDEC. These methods differ in how they handle textures in-flight on the accelerator, and it is wise to evaluate other factors, such as VRAM utilization, when they are in use. Typically, you can take advantage of the CUVID-based hwaccels for operations such as deinterlacing, if so desired. See their usage via:

ffmpeg -h decoder=h264_cuvid
ffmpeg -h decoder=hevc_cuvid
ffmpeg -h decoder=mpeg2_cuvid

Run Code Online (Sandbox Code Playgroud)

However, beware that handling MBAFF encoded content with these decoders, where double deinterlacing is required, is not advisable as NVIDIA has not yet implemented MBAFF support in the backend. Take a look at this thread for more on the same.

In closing: It is wise to evaluate where and when hardware accelerated offloading (filtering, encoding and decoding) offers an advantage or an acceptable trade-off (in quality, feature support and reliability) in your pipeline prior to deployment in production. This is a vendor-neutral approach when deciding what and when to offload parts of your pipeline, and the same applies to NVIDIA's solutions.

For more information, refer to the hardware acceleration entry in FFmpeg's wiki.

Warning: Be sure to lower the decoder's thread count to 1. These hwaccels, particularly cuvid (and the nvdec wrapper) do not implement threading support. Infact, they'll throw warnings at you if the thread count exceeds 16.

Pass -threads 1 to ffmpeg before input. The argument position of threads is important. In this case, it sets the thread count for the decoder to 1. After the input, it sets the thread count used by FFmpeg's encoders and muxers (if threading is supported) to the configured value.

Samples demonstrating the use of hardware-accelerated filtering, encoding and decoding based on the notes above:

1. Demonstrate the use of 1:N encoding with NVENC:

The following assumption is made: The test-bed only has one NVENC-capable GPU present, a simple GTX 1070. For this reason I'm limited to two simultaneous NVENC sessions, and that is taken into account with the snippets below. Be warned that cases needing to utilize multiple NVENC-capable GPUs will need the command line(s) modified as appropriate.

My sample files are in ~/Desktop/src

I'll be working with a sample file as shown below:

ffmpeg -h filters | grep opencl

Run Code Online (Sandbox Code Playgroud)

ffmpeg -h decoder=h264_cuvid
ffmpeg -h decoder=hevc_cuvid
ffmpeg -h decoder=mpeg2_cuvid

Run Code Online (Sandbox Code Playgroud)

With that information, we can tell that the input file is deinterlaced, encoded at 59.94 FPS. In the examples below, I'll target the same frame rate, using a closed GOP, assuming a fixed keyframe distance of 2 seconds (set by -g 120 where -r=60).

I can run this encoder sample as shown, demonstrating two use cases:

Use the cuvid-based decoder (h264_cuvid) as the deinterlacer (Note that the input format is H.264/AVC and as such, we're using the correct decoder):

ffprobe -i deint-testfile.mkv -show_format -hide_banner -show_streams

Run Code Online (Sandbox Code Playgroud)

2. Use the nvdec hwaccel paired with the yadif_cuda deinterlacer:


Input #0, matroska,webm, from 'deint-testfile.mkv':
  Metadata:
    encoder         : libebml v1.3.3 + libmatroska v1.4.4
    creation_time   : 2016-03-02T23:20:05.000000Z
  Duration: 00:04:56.97, start: 0.066000, bitrate: 31036 kb/s
    Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, top first), 1920x1080 [SAR 1:1 DAR 16:9], 59.94 fps, 59.94 tbr, 1k tbn, 59.94 tbc (default)
    Metadata:
      BPS             : 29131349
      BPS-eng         : 29131349
      DURATION        : 00:04:56.896000000
      DURATION-eng    : 00:04:56.896000000
      NUMBER_OF_FRAMES: 17598
      NUMBER_OF_FRAMES-eng: 17598
      NUMBER_OF_BYTES : 1081122637
      NUMBER_OF_BYTES-eng: 1081122637
      _STATISTICS_WRITING_APP: mkvmerge v8.9.0 ('Father Daughter') 64bit
      _STATISTICS_WRITING_APP-eng: mkvmerge v8.9.0 ('Father Daughter') 64bit
      _STATISTICS_WRITING_DATE_UTC: 2016-03-02 23:20:05
      _STATISTICS_WRITING_DATE_UTC-eng: 2016-03-02 23:20:05
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
      _STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
    Stream #0:1: Audio: dts (DTS-HD MA), 48000 Hz, stereo, s32p (24 bit) (default)
    Metadata:
      BPS             : 1907258
      BPS-eng         : 1907258
      DURATION        : 00:04:56.896000000
      DURATION-eng    : 00:04:56.896000000
      NUMBER_OF_FRAMES: 27834
      NUMBER_OF_FRAMES-eng: 27834
      NUMBER_OF_BYTES : 70782196
      NUMBER_OF_BYTES-eng: 70782196
      _STATISTICS_WRITING_APP: mkvmerge v8.9.0 ('Father Daughter') 64bit
      _STATISTICS_WRITING_APP-eng: mkvmerge v8.9.0 ('Father Daughter') 64bit
      _STATISTICS_WRITING_DATE_UTC: 2016-03-02 23:20:05
      _STATISTICS_WRITING_DATE_UTC-eng: 2016-03-02 23:20:05
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
      _STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
[STREAM]
index=0
codec_name=h264
codec_long_name=H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10
profile=High
codec_type=video
codec_time_base=317/38002
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
width=1920
height=1080
coded_width=1920
coded_height=1088
has_b_frames=1
sample_aspect_ratio=1:1
display_aspect_ratio=16:9
pix_fmt=yuv420p
level=41
color_range=tv
color_space=bt709
color_transfer=bt709
color_primaries=bt709
chroma_location=left
field_order=tt
timecode=N/A
refs=1
is_avc=true
nal_length_size=4
id=N/A
r_frame_rate=19001/317
avg_frame_rate=19001/317
time_base=1/1000
start_pts=66
start_time=0.066000
duration_ts=N/A
duration=N/A
bit_rate=N/A
max_bit_rate=N/A
bits_per_raw_sample=8
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=1
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
TAG:BPS=29131349
TAG:BPS-eng=29131349
TAG:DURATION=00:04:56.896000000
TAG:DURATION-eng=00:04:56.896000000
TAG:NUMBER_OF_FRAMES=17598
TAG:NUMBER_OF_FRAMES-eng=17598
TAG:NUMBER_OF_BYTES=1081122637
TAG:NUMBER_OF_BYTES-eng=1081122637
TAG:_STATISTICS_WRITING_APP=mkvmerge v8.9.0 ('Father Daughter') 64bit
TAG:_STATISTICS_WRITING_APP-eng=mkvmerge v8.9.0 ('Father Daughter') 64bit
TAG:_STATISTICS_WRITING_DATE_UTC=2016-03-02 23:20:05
TAG:_STATISTICS_WRITING_DATE_UTC-eng=2016-03-02 23:20:05
TAG:_STATISTICS_TAGS=BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
TAG:_STATISTICS_TAGS-eng=BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
[/STREAM]
[STREAM]
index=1
codec_name=dts
codec_long_name=DCA (DTS Coherent Acoustics)
profile=DTS-HD MA
codec_type=audio
codec_time_base=1/48000
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
sample_fmt=s32p
sample_rate=48000
channels=2
channel_layout=stereo
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/1000
start_pts=76
start_time=0.076000
duration_ts=N/A
duration=N/A
bit_rate=N/A
max_bit_rate=N/A
bits_per_raw_sample=24
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=1
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
TAG:BPS=1907258
TAG:BPS-eng=1907258
TAG:DURATION=00:04:56.896000000
TAG:DURATION-eng=00:04:56.896000000
TAG:NUMBER_OF_FRAMES=27834
TAG:NUMBER_OF_FRAMES-eng=27834
TAG:NUMBER_OF_BYTES=70782196
TAG:NUMBER_OF_BYTES-eng=70782196
TAG:_STATISTICS_WRITING_APP=mkvmerge v8.9.0 ('Father Daughter') 64bit
TAG:_STATISTICS_WRITING_APP-eng=mkvmerge v8.9.0 ('Father Daughter') 64bit
TAG:_STATISTICS_WRITING_DATE_UTC=2016-03-02 23:20:05
TAG:_STATISTICS_WRITING_DATE_UTC-eng=2016-03-02 23:20:05
TAG:_STATISTICS_TAGS=BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
TAG:_STATISTICS_TAGS-eng=BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
[/STREAM]
[FORMAT]
filename=deint-testfile.mkv
nb_streams=2
nb_programs=0
format_name=matroska,webm
format_long_name=Matroska / WebM
start_time=0.066000
duration=296.972000
size=1152134036
bit_rate=31036839
probe_score=100
TAG:encoder=libebml v1.3.3 + libmatroska v1.4.4
TAG:creation_time=2016-03-02T23:20:05.000000Z
[/FORMAT]

Run Code Online (Sandbox Code Playgroud)

You can use an extra filter before the yadif_cuda deinterlacer, hwupload_cuda in cases where hardware accelerated decode is undesirable. When you call up the hwupload_cuda filter, it automatically creates a device type cuda, converts all in-flight textures to the cuda format and uploads them to the shared CUDA hardware context from which the latter filter yadif_cuda can operate on. However, if you pass the option -hwaccel_output_format cuda you can skip this extra hwupload_cuda filter. This is the preferred method for maximum throughput.

The options specified for the yadif_cuda filter are:

(a). Set the deinterlaing mode as send one frame for each frame.

(b). Set the assumed picture type parity as automatic.

(c). To only deinterlace frames marked as deinterlaced.

You can confirm this by running:


   ffmpeg -threads 1 -fflags +genpts -y -c:v h264_cuvid -surfaces 8 -deint 2 -drop_second_field 1 -hwaccel_output_format cuda \
   -i 'deint-testfile.mkv' -filter_complex \
  "[0:v:0]split=2[a][b]; \
   [a]scale_npp=w=1280:h=720:interp_algo=super[c]; \
   [b]scale_npp=w=640:h=360:interp_algo=super[d]" \
   -bsf:a aac_adtstoasc -af "aresample=async=1000:min_hard_comp=0.100000" -c:a aac -ac 2 -ar 48000 -b:a 128k -vsync 1 \
  -b:v:0 6000k -minrate:v:0 6000k -maxrate:v:0 6000k -bufsize:v:0 400k -c:v:0 h264_nvenc \
  -profile:v:0 high -rc:v:0 cbr_ld_hq -level:v:0 4.2 -r:v:0 59.94 -g:v:0 120 -bf:v:0 3 -strict_gop:v:0 1 \
  -b:v:1 4200k -minrate:v:1 4200k -maxrate:v:1 4200k -bufsize:v:1 280k -c:v:1 h264_nvenc \
  -profile:v:1 high -rc:v:1 cbr_ld_hq -level:v:1 4.2 -r:v:1 59.94 -g:v:1 120 -bf:v:1 3 -strict_gop:v:1 1 \
  -flags +global_header+cgop \
  -map "[c]" -map "[d]" -map a:0 \
  -f tee  \
  "[select=\'v:0,a\':f=flv]"/home/brainiarc7/Desktop/src/cheeks0.flv"| \
   [select=\'v:1,a\':f=flv]"/home/brainiarc7/Desktop/src/cheeks1.flv""

Run Code Online (Sandbox Code Playgroud)

You can also attempt double de-interlacing (wherein the de-interlacer sends one frame per field, instead of one frame per frame) by applying the deinterlacer options below.(see the filter options passed in yadif_cuda=1:-1:1):


   ffmpeg -threads 1 -fflags +genpts -y -hwaccel nvdec -hwaccel_output_format cuda \
   -i 'deint-testfile.mkv' -filter_complex \
  "[0:v:0]yadif_cuda=0:-1:1,split=2[a][b]; \
   [a]scale_npp=w=1280:h=720:interp_algo=super[c]; \
   [b]scale_npp=w=640:h=360:interp_algo=super[d]" \
   -af "aresample=async=1000:min_hard_comp=0.100000" -c:a aac -ac 2 -ar 48000 -b:a 128k -vsync 1 \
  -b:v:0 6000k -minrate:v:0 6000k -maxrate:v:0 6000k -bufsize:v:0 400k -c:v:0 h264_nvenc \
  -profile:v:0 high -rc:v:0 cbr_ld_hq -level:v:0 4.2 -r:v:0 59.94 -g:v:0 120 -bf:v:0 3 -strict_gop:v:0 1 \
  -b:v:1 4200k -minrate:v:1 4200k -maxrate:v:1 4200k -bufsize:v:1 280k -c:v:1 h264_nvenc \
  -profile:v:1 high -rc:v:1 cbr_ld_hq -level:v:1 4.2 -r:v:1 59.94 -g:v:1 120 -bf:v:1 3 -strict_gop:v:1 1 \
  -flags +global_header+cgop \
  -map "[c]" -map "[d]" -map a:0 \
  -f tee  \
  "[select=\'v:0,a\':f=flv]"/home/brainiarc7/Desktop/src/cheeks0.flv"| \
   [select=\'v:1,a\':f=flv]"/home/brainiarc7/Desktop/src/cheeks1.flv""

Run Code Online (Sandbox Code Playgroud)

However, be cautious with this option as it may fail at some specific frame rates. In my testing, using NTSC interlaced content at 29.970 FPS resulted in failure when attempting a double deinterlace. Your mileage may vary.

3. Demonstrating the use of an OpenCL filter with the NVIDIA GPU:

The filter we will use in this case is the tonemap_opencl, with the following usage options:

ffmpeg -h filter=yadif_cuda

Run Code Online (Sandbox Code Playgroud)

Filter tonemap_opencl
  perform HDR to SDR conversion with tonemapping
    Inputs:
       #0: default (video)
    Outputs:
       #0: default (video)
tonemap_opencl AVOptions:
  tonemap           <int>        ..FV..... tonemap algorithm selection (from 0 to 6) (default none)
     none                         ..FV.....
     linear                       ..FV.....
     gamma                        ..FV.....
     clip                         ..FV.....
     reinhard                     ..FV.....
     hable                        ..FV.....
     mobius                       ..FV.....
  transfer          <int>        ..FV..... set transfer characteristic (from -1 to INT_MAX) (default bt709)
     bt709                        ..FV.....
     bt2020

归档时间：	6 年，7 月前
查看次数：	5240 次
最近记录：	6 年前