# l've ever done

Keith Bannister - Co-learnium - 17 June 2021

# Digital design: The hardest thing



# The ASKAP/CRAFT Coherent upgrade

FRB2020 Keith Bannister - keith.bannister@csiro.au @pleasefftme

With: Xinping Deng, Li Bang On behalf of the CRAFT collaboration



With: Xinping Deng, Li Bang On behalf of the CRAFT collaboration

# Localising an FRB/day by shoving 2 million Youtube viewers into a fridge

FRB2020 Keith Bannister - keith.bannister@csiro.au @pleasefftme

# ASKAP

- 36 x 12m antennas
- Each antenna: 188 receivers
- Each: 36 beams =  $\sim$  30 deg<sup>2</sup> per antenna
- Total: 1296 beams =  $1000 \text{ deg}^2$  in Flyseye
- Tuning: 0.7-1.8 GHz
- 336 x 1 MHz channels
- Autocorrelations with ~1ms time resolution
- 6km max baseline = 6" synthesised beam at 1.4 GHz
- 7000 Receivers
- 20 000 Lasers
- 15 500km of fibre: Sydney to LA.
- 72 Tbits/sec off samplers = 10% of the internet
- 1.6 MW PV Solar Array enough to power a small village
- 2.5 MWhr Lithium ion battery



# But: It isn't fully armed an operational

- Current processing: Incoherent sum sensitivity  $\propto N^{1/2}$
- Proposed method Fast imaging of visibilities: Fully coherent sensitivity  $\propto N$
- i.e. 5x more sensitive than current method (don't process outer 6 antennas)
- ~0.5-2 FRBs/day each with with ~arc second localisation (dependent on logN-logS)







Shannon+17 Supplementary







# **ASKAP Fast Imaging**

- Typically we'll use 30 antennas within 2 km diameter each with 288 channels, 1ms integrations.
- Discrete sampling of UV plane = Npix x Npix = 256x256 cell grid.
- The positions of the baselines are essentially static for ~30 seconds (Earth rotates slowly).
- Every millisecond we get visibilities for 30 antennas = 436 baselines x 288 channels ~ 130k measurements
- Many of the channels fall in the same cell in the grid. We'll average those (in a preprocessing step - the FDMT) by a factor of ~25x
- There are only ~6500 non-zero points in the UV plane i.e.
   the UV plane is 90% zeros (!)



22 antennas in 1km radius.

## Fast imaging: In a nutshell



Existing hardware

New Hardware

**GPU** tasks

**CPU** tasks

# **Processing Steps**





# GPU vs FPGA Smackdown





|                       | NVIDIA V100 GPU | Xilinx Alveo U280  | Xilinx Alveo U50 |
|-----------------------|-----------------|--------------------|------------------|
| Cost                  | ~\$15k AUD      | ~\$10k AUD         | \$3k AUD         |
| Memory                | 16/32 GB        | 8GB HBM + 32GB DDR | 8GB HBM          |
| "Memory<br>Bandwidth" | 900 GB/sec      | 460 GB/Sec         | 316 GB/sec       |
| "Computing"           | "100 TFlops"    | 24.5 Tops (int8)   | (less than U280) |
| L1 Cache              | 96KB x 84 = 8MB | 41 MB(!)           | 28 MB            |
| L1 cache rate         | 24 TB/sec       | 30 TB/sec          | 24 TB/sec        |
| Power                 | 300W            | 225 W              | 75W              |
| Programming           | CUDA :-)        | HLS :-(            | HLS :-(          |





### Designing hardware

- Usually done in VHDL/Verilog which is thought to be difficult
- New Thing: HLS you can write in a software-language = EASY!
- BUT: You're still designing hardware it turns out, that's the thing that's hard. The language is secondary.
- Things you have to think about:
  - Consumption of different types of on-chip resource: LUTs, BRAM, URAM, SRL
  - Way in which off-chip memory is accessed: Data width, burst sizes, read vs write, outstanding reads/writes
  - Routing: How different functions are connected both the data path *and the control path* (in HLS the control path can be hidden from you a bit).
  - Timing issues how the architecture is implemented on the chip affects how fast it will run.



### Challenges we've had

- Not being digital engineers and not realising it's a problem
- Massive learning curve
- First project and pushing the limits of what's possible.
- Bugs/limitations in the tools
- Working around bugs/limitations in the hardware



### Block diagram generated by the tool



### Placement on the actual chip

> I fdmt\_tunable\_c32\_1 (pfm\_dynamic\_fdmt\_tu > 🛃 fft2d\_1 (pfm\_dynamic\_fft2d\_1\_0) > dft2d\_2 (pfm\_dynamic\_fft2d\_2\_0) > dft2d\_3 (pfm\_dynamic\_fft2d\_3\_0) > I fft2d\_4 (pfm\_dynamic\_fft2d\_4\_0) > 1 hmss\_0 (pfm\_dynamic\_hmss\_0\_0) > 🛃 init cal combine mss (pfm dynamic init ca > 📝 init combine mss (pfm dynamic init combin > X interrupt\_concat (pfm\_dynamic\_interrupt\_c krnl\_boxc\_4cu\_1 (pfm\_dynamic\_krnl\_boxc\_4) > 1 krnl\_grid\_4cu\_1 (pfm\_dynamic\_krnl\_grid\_4cu > 1 krnl\_grid\_4cu\_2 (pfm\_dynamic\_krnl\_grid\_4cu > krnl\_grid\_4cu\_3 (pfm\_dynamic\_krnl\_grid\_4cu > I krnl\_grid\_4cu\_4 (pfm\_dynamic\_krnl\_grid\_4cu > 🚺 krnl\_sync\_stream\_uv\_4cu\_1 (pfm\_dynamic\_k



|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31| |0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|



### **Debugging waveforms**



• See git 8b4136a3c6b501384f72a622b251b4692820f234

| + |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|---|---------|----|---------|---------|---------|---------|---------|---------|---------|---------|----|---|----|----|---------|---------|---------|------|-----|-----|-----|----|-----------|---|
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      | 365 | 5.9 | 74  | ns |           |   |
|   |         | 2  | 60.     | 000     | 0 n     | 5       |         |         |         | _       |    |   | 30 | 0. | 00      | 3 n     | is<br>  | <br> |     | _   |     | 40 | 30.       |   |
| ) | $\odot$ | Ģ  | $\odot$ | x. | Ģ | Х  | •  | $\odot$ | $\odot$ | $\odot$ | 11   | X   | χ   | X   | k  | X         |   |
|   |         |    |         |         |         |         | C       | 0       |         |         |    |   |    |    |         |         |         |      |     |     |     | λŪ | χ÷        |   |
|   | 2       | З  | 4       | 5       | 6       | 7       | 8       | 9       | 10      | 11      | 12 | 1 | 3] | 14 | 15      | 16      | 17      | 18   | 1   | 92  | 021 | ŀ  | ŀ         |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   | 2       | З  | 4       | 5       | 6       | 7       | 8       | 9       | 10      | 11      | 12 | 1 | 3  | 14 | 15      | 16      | 17      | 18   | •   | 1   | 919 | 1  | $\ \cdot$ |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   |         | ШП |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     | 1   | 2   | 2  | 2         |   |
|   |         |    |         |         |         |         |         |         |         |         |    |   |    |    |         |         |         |      |     |     |     |    |           | 1 |

| 🚪 Module Hierarchy     |            |       |      |     |       |      | 🔯 🔺 🔞   |          |          |
|------------------------|------------|-------|------|-----|-------|------|---------|----------|----------|
| lame                   | Issue Type | Slack | BRAM | DSP | FF    | LUT  | Latency | Interval | Pipelin  |
|                        |            |       | 30   | 0   | 12297 | 9071 | 337     | 338      | no       |
| ▼                      |            |       | 0    | 0   | 10649 | 7272 | 82      | 4        | dataflow |
| grid_process_and_write | 🔁 Π        | -     | 0    | 0   | 2567  | 3146 | 7       | 4        | yes      |
| grid_read              | 📆 II       | -     | 0    | 0   | 3666  | 728  | 74      | 4        | yes      |



### **Trying to meet timing of grid kernel** Control logic in large Vitis kernels is slowing design. Might change to smaller kernels



|    | ange t                   | 0       | sm           | 18      | lle      | Ì   | ' K      | erne            | S     |       |          |     |
|----|--------------------------|---------|--------------|---------|----------|-----|----------|-----------------|-------|-------|----------|-----|
| Pa | th 1 - bd_0_wrapp        | oer_tim | ing_su       | mma     | ary_rou  | te  | d ×      | Schematic       | ×     | Sch   | emat     | ic  |
| ~  | Summary                  |         |              |         |          |     |          |                 |       |       |          |     |
| Ľ  | Name                     | 👍 Pat   | h l          |         |          |     |          |                 |       |       |          |     |
|    | Slack                    | -0.855  | ns           |         |          |     |          |                 |       |       |          |     |
|    | Source                   | 🕞 bd    | 0_i/hls_     | inst/   | inst/ap_ | сs  | _fsm_re  | g[1]_replica_1, | /C (I | risin | g edg    | je  |
|    | Destination              | 🕞 bd    | 0_i/hls_     | inst/   | inst/grp | st  | reaming  | g_grid_6cu_idm  | _fu_3 | 376/  | dataf    | lo  |
|    | Path Group               | ap_clk  |              |         |          |     |          |                 |       |       |          |     |
|    | Path Type                | Setup   | (Max at      | Slov    | w Proces | SS  | Corner)  |                 |       |       |          |     |
|    | Requirement              | 2.500   | ns (ap_d     | :lk ri: | se@2.5   | 00  | ns - ap_ | clk rise@0.000  | )ns)  |       |          |     |
|    | Data Path Delay          | 2.987   | ns (logia    | 0.3     | 64ns (1  | 2.1 | L86%) r  | oute 2.623ns    | (87.8 | 314%  | 6))      |     |
|    | Logic Levels             | 6 (LU   | T2=1 LU      | JT5=    | 1 LUT6=  | =4) |          |                 |       |       |          |     |
|    | Clock Path Skew          | 0.009   | ns           |         |          |     |          |                 |       |       |          |     |
|    | Clock Unrtainty          | 0.035   | ns           |         |          |     |          |                 |       |       |          |     |
| ~  | Source Clock Pat         | th      |              |         |          |     |          |                 |       |       |          |     |
| L  | Delay Type               |         | lncr (n      | s)      | Path     |     | Locatio  | on              |       | Ne    | etlist I | Re  |
|    | (clock ap_clk rise e     | edge)   | (r) 0.(      | 000     | 0.00     | 0   |          |                 |       |       |          |     |
|    |                          |         | (r) 0.0      | 000     | 0.00     | 0   |          |                 |       | Ð     | ap_c     | :lk |
|    | <b>net</b> (fo=150453, u | inset)  | 0.0          | 030     | 0.03     | 0   |          |                 |       | 1     | bd_0     | )_i |
|    | FDRE                     |         |              |         |          |     | Site: S  | LX114Y143       |       | D     | bd_0     | )_i |
|    | Data Path                |         |              |         |          |     |          |                 |       |       |          |     |
| Ц  | Delay Type               |         |              |         | r (ns)   | P   | ath      | Location        |       |       |          | 1   |
|    | FDRE (Prop_HFF_S         | LICEM_C | <u>_Q)</u>   | (f)     | 0.079    |     | 0.109    | Site: SLIX1     | 14Y1  | .43   |          | •   |
|    | net (fo=2, routed)       |         |              |         | 0.758    |     | 0.867    |                 |       |       |          |     |
|    | LUT6 (Prop_A6LUT         | SLICEM  | _14_0)       | (r)     | 0.038    |     | 0.905    | Site: SLICE_X   | 95Y1  | 83    |          | •   |
|    | net (fo=5, routed)       |         |              |         | 0.147    |     | 1.052    |                 |       |       |          |     |
|    | LUT6 (Prop_C6LUT         | SLICEM  | <u> </u>     | (f)     | 0.037    |     | 1.089    | Site: SLICE_X   | 95Y1  | 87    |          | •   |
|    | net (fo=11, routed       | )       |              |         | 0.453    |     | 1.542    |                 |       |       |          |     |
|    | LUT2 (Prop_F6LUT         | SLICEL  | <u> 1_0)</u> | (r)     | 0.035    |     | 1.577    | Site: SLIX1     | 1271  | 88    |          | •   |
|    | net (fo=1, routed)       |         |              |         | 0.039    |     | 1.616    |                 |       |       |          |     |
|    | LUT6 (Prop_H6LUT         | SLICEL  | 12_0)        | (r)     | 0.036    |     | 1.652    | Site: SLIX1     | 1271  | 88    |          | •   |
|    | net (fo=10, routed       | d)      |              |         | 0.237    |     | 1.889    |                 |       |       |          |     |
|    | LUT6 (Prop_A6LUT         | SLICEL  | <u>11_0)</u> | (r)     | 0.050    |     | 1.939    | Site: SLIX1     | 13Y1  | 90    |          | •   |
|    | net (fo=1, routed)       |         |              |         | 0.141    |     | 2.080    |                 |       |       |          |     |
|    | LUT5 (Prop_B6LUT         | SLICEM  | _14_0)       | (r)     | 0.089    |     | 2.169    | Site: SLIX1     | 14Y1  | 92    |          | •   |
|    | net (fo=9, routed)       |         |              |         | 0.848    |     | 3.017    |                 |       |       |          |     |
|    | RAMB36E2                 |         |              |         |          |     |          | Site: RAMB36    | _X8Y  | 39    |          | [   |
|    | Arrival Time             |         |              |         |          |     | 3.017    |                 |       |       |          |     |

### **Example code - and hardware block**

```
<int iterno>
d fdmt_process_iteration_ctloop(
    fdmt_stream& in_stream,
    fdmt_stream& out_stream,
    fdmt_config_entry_t config_table[NCIN/2],
    FdmtRingFifos<iterno>& fifos)
constexpr int ncout = MyConfig::nchan out for iter(iterno);
// Need to mark these stable so do iteration() can overlap with stream into buff
    HLS STABLE variable=config_table
    HLS STABLE variable=fifos
ct loop:
 for(int ct = 0; ct < ncout*NT; ct++) {</pre>
    HLS DATAFLOW // DATAFLOW with PIPO
    fdmt_complex_t buffer[2][MyConfig::ndm_in_for_iter(iterno)];
    HLS STREAM variable=buffer depth=2 off
    fdmt_process_stream_into_buffer<iterno>(buffer, in stream);
    fdmt_process_do_iteration<iterno>(ct,
            buffer,
            out_stream,
            config_table,
            fifos);
```

FIFOS are marked as STABLE in code

| - |   |   |    |
|---|---|---|----|
| - | - | - | ۰. |
|   |   |   |    |
|   | - |   |    |

| ⊡ Summary                               |       |      |            |                                |              |
|-----------------------------------------|-------|------|------------|--------------------------------|--------------|
| RTL Ports                               | Dir   | Bits | Protocol   | Source Object                  | C Type       |
| ap_clk                                  | in    | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_rst                                  | in    | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_start                                | in    | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_done                                 | out   | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_continue                             | in    | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_idle                                 | out   | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| ap_ready                                | out   | 1    | ap_ctrl_hs | fdmt_process_do_iteration<0>   | return value |
| s1_V_din                                | out   | 32   | ap_fifo    | s1_V                           | pointer      |
| s1_V_full_n                             | in    | 1    | ap_fifo    | s1_V                           | pointer      |
| s1_V_write                              | out   | 1    | ap_fifo    | s1_V                           | pointer      |
| configs_address0                        | out   | 4    | ap_stable  | configs                        | array        |
| configs_ce0                             | out   | 1    | ap_stable  | configs                        |              |
| configs_q0                              | in    | 32   | ap_stable  |                                | array        |
| ct_dout                                 | in    | 10   | ap_fifo    |                                | pointer      |
| ct_empty_n                              | in    | 1    | ap_fifo    | ct                             | pointer      |
| ct_read                                 | out   | 1    | ap_fifo    | ct                             | pointer      |
| buffer_r_address0                       | out   | 3    | ap_memory  | buffer_r                       | array        |
| buffer_r_ce0                            | out   | 1    | ap_memory  | buffer_r                       | array        |
| buffer_r_q0                             | in    |      | ap_memory  | buffer_r                       | array        |
| buffer_r_address1                       | out   | 3    | ap_memory  | buffer_r                       | array        |
| buffer_r_ce1                            | out   | 1    | ap_memory  | buffer_r                       | array        |
| buffer_r_q1                             | in    |      | ap_memory  | buffer_r                       |              |
| fifos0_m_process_buffer_data_V_address( | ) out | 8    | ap_memory  | fifos0_m_process_buffer_data_V | array        |
| fifos0_m_process_buffer_data_V_ce0      | out   | 1    | ap_memory  | fifos0_m_process_buffer_data_V | array        |
| fifos0_m_process_buffer_data_V_we0      | out   |      | <u> </u>   | fifos0_m_process_buffer_data_V |              |
| fifos0_m_process_buffer_data_V_d0       | out   | 32   | ap_memory  | fifos0_m_process_buffer_data_V | array        |
| fifos0_m_process_buffer_data_V_address2 | lout  | 8    | ap_memory  | fifos0_m_process_buffer_data_V | array        |
| fifos0_m_process_buffer_data_V_ce1      | out   |      | · - ·      | fifos0_m_process_buffer_data_V |              |
| fifos0_m_process_buffer_data_V_q1       | in    | 32   | ap_memory  | fifos0_m_process_buffer_data_V | array        |

But in the synthesis report the FIFOS are ap\_memory (unlike the configuration table)

If I do a different build with a different top function



### Synthesis report

m\_axi\_gmem1

m\_axi\_gmem2

m\_axi\_gmem3

m\_axi\_gmem4

64 -> 64

64 -> 128

64 -> 64

64 -> 64

64 slave

64 slave

64 slave

64 slave

64

64

64

64

| 🔹 Genera | I Information                   |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
|----------|---------------------------------|--------------------|-------|---------------|-----------------------|------------|------------|------------|-----------------|----------------|--------------|-------|----------------|------------|---------|
| Date:    | Wed Jun 16 10:20:56 2021        |                    |       |               |                       |            | Solutior   | n: so      | olution1 (Vitis | Kernel Flow Ta | raet)        |       |                |            |         |
|          |                                 | ov 18 09:12:47 MST | 2020) |               |                       |            |            | family: vi |                 |                | <b>J - ,</b> |       |                |            |         |
|          | fdmt_tunable                    |                    |       |               |                       |            |            |            | 20280-fsvh289   | 2-2L-e         |              |       |                |            |         |
| - Timing | Estimate                        |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| Ø        |                                 |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| Target   | Estimated Uncertainty           |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| 2.50 ns  | 3.557 ns 0.50 ns                |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
|          |                                 |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| - Perfor | mance & Resource Estimates      | 0                  |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| 1: 🔯     | 🔼 🌐 % 🗸 Modules 🗸 Loc           | ops 🖽 🖬 陆 😥 🔞      |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
|          |                                 |                    |       |               | ->                    |            |            |            |                 | Black lines    |              |       |                |            |         |
|          | s & Loops                       | Issue Type         | Slack | Latency(cycle | s) Latency(n          | s) Iterati | on Latency | Interval   |                 |                |              |       |                |            | URAM    |
|          | nt_tunable_c32                  |                    | -1.56 |               | -                     | -          | -          |            |                 |                | 54           |       | 23128          |            | 8       |
|          | idmt_run_with_config            |                    | -1.56 |               | -                     | -          | -          | -          | -               | dataflow       | 4            |       | 18263<br>15035 |            | 8       |
|          | fdmt_process_nbank fdmt_read    |                    | -     | 63            | -<br>17 1.554e        | -          | -          | -<br>6217  | -               | no             | 4<br>0       |       | 1132           |            | o       |
|          | <pre>fdmt_read</pre>            | 🔔 Timing Violation | -1 56 |               | 17 1.554<br>18 2.212¢ |            | -          | 6217       | -               | no<br>no       | 0            | 0     | 521            | 909<br>777 |         |
|          | <pre>fdmt_read_history</pre>    |                    | 1.50  |               | 17 2.0546             |            | -          | 8217       | -               |                | 0            | 0     | 358            | 601        | o       |
|          | <pre> fdmt_write_history </pre> |                    | -     |               | 29 2.1826             |            | -          | 8729       | -               |                | 0            | 0     | 386            | 387        | 0       |
|          | fdmt_run_with_config_entry290   | )                  |       |               |                       | .0         |            | 0          |                 |                | 0            | 0     | 3              | 110        | o       |
|          | dmt_get_config_from_hbm296      |                    | -     | 3             | 20 800.00             | 00         | -          | 320        | -               | no             | 0            | 0     | 150            | 508        | o       |
| + HW Int | erfaces                         |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| M_AXI    |                                 |                    |       |               |                       |            |            |            |                 |                |              |       |                |            |         |
| Interfa  | ce Data Width (SW->HW           | ) Address Width    | Laten | cy Offset Off | set Interfaces        | Register   | Max Wide   | n Bitwidth | Max Read        | Burst Lengt    | h Max        | Write | Burs           | t Lengt    | h Num R |
| m_axi_g  | gmem0 64 -> 51                  | 2 64               |       | 64 slave      | s_axi_control         | 0          |            | 512        | 2               | 1              | .6           |       |                | 1          | .6      |

s\_axi\_control

s\_axi\_control

s\_axi\_control

s\_axi\_control

| Register M | lax Widen Bitwidth | Max Read Burst Length | Max Write Burst Length | Num Read Outstand |
|------------|--------------------|-----------------------|------------------------|-------------------|
| 0          | 512                | 16                    | 16                     |                   |
| 0          | 512                | 16                    | 16                     |                   |
| 0          | 512                | 16                    | 16                     |                   |
| 0          | 512                | 16                    | 16                     |                   |
| 0          | 512                | 16                    | 16                     |                   |

### Depth of modules

| odules & Loops                                                 | Issue Type Slac       | Latency(cycles) | Latency(ns) | Iteration Latency | Interval | Trip Count |
|----------------------------------------------------------------|-----------------------|-----------------|-------------|-------------------|----------|------------|
| ect to add a body text box.                                    |                       | <b>5</b> -      | -           | -                 | -        | -          |
| fdmt_run_with_config                                           | -1.5                  | 6 -             |             |                   |          |            |
| ✓                                                              |                       |                 | -           | -                 | -        | -          |
| ▼  o process_in_loop                                           |                       |                 | -           | -                 | -        | -          |
| ▼                                                              |                       |                 | -           | -                 | -        | -          |
| <ul> <li>fdmt_iteration_0_216</li> </ul>                       |                       |                 | -           | -                 | -        | -          |
| <ul> <li>fdmt_process_iteration_0_s</li> </ul>                 |                       | - 2379          | 5.947e3     | -                 | 2379     | -          |
| <ul> <li>fdmt_process_iteration_tloop_allchan_0_s</li> </ul>   |                       | - 2378          | 5.945e3     | -                 | 2378     | -          |
| √                                                              |                       | - 168           | 420.000     | -                 | 96       | -          |
| <ul> <li>fdmt_process_do_iteration_allchan_0_s</li> </ul>      |                       | - 70            | 175.000     |                   | 64       | -          |
| shift_4                                                        |                       | - 0             | 0.0         | -                 | 1        | -          |
| C cout_loop_d_loop                                             |                       | - 69            | 172.000     | 7                 | 1        | 64         |
| <ul> <li>fdmt_process_stream_into_buffer_allchan_0_</li> </ul> | s                     | - 97            | 242.000     |                   | 96       | -          |
| C c_loop_d_loop                                                |                       | - 96            | 240.000     | 2                 | 1        | 96         |
| C t_loop                                                       |                       | - 2377          | 5.942e3     | 2377              | -        | 24         |
| fdmt_load_fifos_0_s                                            |                       |                 |             |                   |          |            |
| 6 fdmt_iteration_2_218                                         |                       |                 | -           | -                 | -        | -          |
| fdmt_iteration_3_219                                           |                       |                 | -           | -                 | -        | -          |
| fdmt_iteration_4_220                                           |                       |                 |             |                   |          |            |
| fdmt_iteration_1_217                                           |                       |                 | -           | -                 | -        | -          |
| fdmt_read_and_init_mux                                         |                       | - 3370          | 8.425e3     | -                 | 3370     | -          |
| fdmt_transpose                                                 |                       | - 1022          | 2.555e3     |                   | 1022     |            |
| fdmt_read_config                                               |                       | - 96            | 240.000     | -                 | 96       | -          |
| C urest_loop                                                   |                       |                 |             |                   |          | 8          |
| ✓                                                              |                       | - 6217          | 1.554e4     | -                 | 6217     | -          |
| C urest_loop_t_loop_c_loop                                     |                       | - 6145          | 1.536e4     | 3                 | 1        | 6144       |
| ▼      fdmt_write                                              | Timing Violation -1.5 | 6218            | 2.212e4     |                   | 6218     |            |
| fdmt_write_ntout                                               |                       | - 74            | 185.000     | -                 | 3        | -          |
| C urest_loop_tblk_loop_d_loop                                  | 👍 Timing Violation    | - 6216          | 2.211e4     | 76                | 3        | 2048       |
| ✓ ⊚ fdmt_read_history                                          |                       | - 8217          | 2.054e4     |                   | 8217     |            |
| C urest_loop_h_loop                                            |                       | - 8145          |             | 3                 |          | 8144       |
| fdmt_write_history                                             |                       | - 8729          |             | -                 | 0700     | -          |
| fdmt_run_with_config_entry290                                  |                       | - 0             |             |                   | ~        | -          |
| <ul> <li>fdmt_get_config_from_hbm296</li> </ul>                |                       | - 320           |             | -                 | 220      |            |

### **Dataflow View**





### **Design rationale Imaging pipeline**

- Original goal: 1 Million FFTs/second
- Xilinx-supplied FFTs
- Block floating point FFT2D (March 2020) SSR=8.
- Fixed-point FFT2D: (Oct 2020)
  - Fmax=400 MHz, SSR=16, 16%/SLR LUTS
- Requirement can be achieved with 6 CU @ 350 MHz.
- At 3 CU/SLR, it leaves 50% of the SLR0 and SLR1 for grid/boxcar kernels.
- FDMT is by itself on SLR2

FFT Compute units required to achieve 1 M FFTs/sec

| NCU | LUTs<br>(% of<br>SLR) | Fmax at 100%<br>efficiency<br>(MHz) |
|-----|-----------------------|-------------------------------------|
| 1   | 16                    | 2048                                |
| 2   | 32                    | 1024                                |
| 3   | 48                    | 682                                 |
| 4   | 64                    | 512                                 |
| 5   | 80                    | 410                                 |
| 6   | 96                    | 341                                 |
| 8   | 128                   | 256                                 |
| 10  | 160                   | 204                                 |



# More block diagrams we tried



### **Original grid - slow (250 MHz)**

Unroll the NFFT in the data path. Makes very wide busses. (Looking at it now, I'm a bit embarrassed we tried this)



Port

Kernel



### Speed now limited by the control path - (i.e. ap\_start from he whole kernel needs to get to all the buffers in 1 clk). Produces all FFT inputs in one kernel.



Unroll NFFT as 6 independent Dataflow processes. They are independent after the buffers. fs=340 MHz

### Proposed: Subdivided grid kernel - improved timing through duplicating control path?

Have one read & transpose kernel that feeds 6 buffer and pad kernels. The tricky bit is you have to split the metadata too. But maybe this is a faster way to do things than a big data flow.



