Severe flooding can pose significant risks to human lives, result in substantial economic losses, and contribute to environmental problems such as soil salinization. An accurate early flood prediction system can effectively minimize these losses. Numerical method was the dominant approach for predicting flood inundation maps in the past decades. However, high-fidelity two-dimensional numerical methods are typically time-consuming. Machine learning methods have gained popularity in recent years, but generating a flood map directly with a small sample of boundary conditions remains challenging and largely unexplored. In this paper, we have developed a machine learning framework capable of directly predicting the maximum flood inundation map from boundary conditions. In our model, time-series boundary conditions are embedded into a higher-dimensional shape and then processed by a transformer encoder. The feature maps, post-processed by the transformer encoder, will be coupled with geophysical information such as a digital elevation map and Manning's coefficient map before being passed to the U-Net structure to obtain the final results. Our proposed model demonstrated notably high accuracy when tested with historical hurricane events. The mean absolute error of our proposed method on all test sets is 0.00717 ft., and the root mean squared error is 0.03974 ft. Furthermore, we conducted parametric studies on the model architecture and observed that they are not as sensitive as input features. Lastly, we provided explanations on why some certain geophysical features are necessary to accurately predict flood inundation maps in this paper.